[PDF] A review of Approximate Bayesian Computation methods via density estimation: inference for simulator-models

Abstract

This paper provides a review of Approximate Bayesian Computation (ABC) methods for carrying out Bayesian posterior inference, through the lens of density estimation. We describe several recent algorithms and make connection with traditional approaches. We show advantages and limitations of models based on parametric approaches and we then draw attention to developments in machine learning, which we believe have the potential to make ABC scalable to higher dimensions and may be the future direction for research in this area.

Full PDF

AA review of Approximate Bayesian Computationmethods via density estimation: inference forsimulator-models

Clara Grazian and Yanan Fan ∗ Article Type:

Advanced Review

Abstract

This paper provides a review of Approximate Bayesian Computation (ABC) methods forcarrying out Bayesian posterior inference, through the lens of density estimation. Wedescribe several recent algorithms and make connection with traditional approaches. Weshow advantages and limitations of models based on parametric approaches and we thendraw attention to developments in machine learning, which we believe have the potential tomake ABC scalable to higher dimensions and may be the future direction for research inthis area. ∗ School of Mathematics and Statistics, UNSW Sydney, Kensington 2052, Australia a r X i v : . [ s t a t . C O ] S e p Introduction

The goal of statistical inference is to draw conclusions about properties of a population givenﬁnite observational data y = ( y (1)0 , . . . , y ( n )0 ). This typically proceeds by ﬁrst specifying aparametric statistical model for the data generating mechanism p ( y | θ ), where θ denotes theparameters of the parametric model. A likelihood function can then be speciﬁed based onthe parametric form. Once the data have been observed, the formal Bayesian inferentialframework then allows us to combine the likelihood function with any prior information,allowing inference to be carried out based on model parameters θ .In practice, and particularly when one wants to deﬁne realistic models for modern appli-cations, the parametric model can be diﬃcult to specify, and the likelihood function may notalways be available in closed form. Approximate Bayesian Computation (ABC) is a class oftools and algorithms which have been developed to perform Bayesian inference in the ab-sence of a likelihood function. A deﬁning feature of this class of algorithms is the existenceand reliance on a known data generating mechanism, so that for any value of θ , we canobtain pseudo-observations using the same mechanism that generated the observed data; wecall this simulator-based models , i.e. models which are speciﬁed only through the generativemechanism. This type of modelling has been proposed in several contexts, examples includeAstrophysics (simulating the formation of galaxies, Cameron and Pettitt (2012)), Neuro-science (simulating neural circuits, Lueckmann et al. (2017)), Econometrics (Gourierouxet al. (1993)), Epidemiology (simulating the spread of bacterial infections, Luciani et al.(2009)), Ecology and Genetics (simulating animal populations dynamics, Beaumont (2010))and so on.More formally, suppose we have a set of n observed data points y = ( y (1)0 , . . . , y ( n )0 ).The data-generating process is known, but the likelihood function is unavailable, due to thefact that either it is too costly to evaluate or simply cannot be analytically computed. Thelatter can happen, for example, in likelihoods involving complex integrals, e.g. in populationgenetics, where there is an integration over coalescence trees (Cornuet et al., 2008) or instochastic volatility models where there is an integration over all the observational period oftime (Creel & Kristensen, 2015). Then, given a particular parameter value (which may be2imulated from its prior distribution), we simulate a new set of pseudo-observations y sim ofthe same dimension y sim ∼ p ( y | θ )where we use the same notation to denote the true but unknown parametric model. Here weno longer have the analytical form of the probability distribution function (pdf), and insteadwe are able to obtain pseudo-observations y sim .The likelihood function is approximated via simulations of the parameter and data pair( θ, y sim ), instead of being analytically evaluated at y . Earlier works on ABC were inextrica-bly intertwined with algorithms, from the rejection-ABC algorithms of Tavar´e et al. (1997)and Pritchard et al. (1999), to MCMC-ABC algorithms (Marjoram et al. (2003) and Bortotet al. (2007)), to the more advanced use of sequential Monte Carlo algorithms (Sisson et al.(2007) and Beaumont (2010)). Sisson and Fan (2018) provides a review on these types ofalgorithmic approaches.The mainstream approach in ABC is to compare the observed data with the simulateddata, usually through the use of some discrepancy measure. If the two are within some smalldistance of each other, then the parameter value θ that generated the pseudo-observationswould be kept and constitutes as a sample from the posterior distribution. In order toincrease the eﬃciency of the algorithms, the discrepancy measure is applied on summarystatistics η ( y ) of the data y instead of the full set of data. Therefore, the parameter isstudied through a likelihood function conditional on observed summary statistics, instead ofdata (Wilkinson, 2013) and consequently the target likelihood function is L ( θ ) ≈ p ( η ( y ) | θ ) , (1)for some choice of summary statistics η ( · ). In doing so, there may be a loss of informationunless the summary statistics are suﬃcient (which is unlikely since traditional approaches toﬁnding suﬃcient statistics require some knowledge of the likelihood function). Details on theselection of summary statistics and discrepancy measures and related aspects of inferencebased on this approach are summarised in Sisson et al. (2018b).Nevertheless, inference based on the above approach can be slow and highly ineﬃcientand can become computationally intractable when the dimension of the parameter θ is large,3s many more datasets will need to be simulated to obtain a good approximation.In this work, we focus our review on inference based on approximations of the likelihood,that is, methods that attempt to directly approximate the likelihood L ( θ ). Such methodshave the advantage that they can be considerably more eﬃcient because they no longerdepend on minimising a discrepancy measure and in some cases, do not even require theelicitation of summary statistics. We begin by reviewing approximation methods wheresome parametric form is assumed for the unknown likelihood in Section 2; in Section 3we describe nonparametric approaches to approximating the likelihood functions, and makesome connections with the standard ABC approaches; in Section 4, we describe some recentdevelopments using machine learning methods which oﬀer the potential of scaling up tohigher dimensions; we provide some examples to illustrate the methods described in thisreview in Section 5, and conclude with some discussion in Section 6. In what follows, we usethe simpliﬁed notations η ( y ) = η and η ( y sim ) = η sim . If the model for the data or summary statistics can be considered reasonably regular (forinstance, if the summary statistic is some type of sample average and if the sample size islarge) then it may be reasonable to approximate the distribution of the summary statisticsby a Gaussian: p ( η | θ ) = 1(2 π ) p/ | Σ θ | / exp (cid:26) − (cid:2) ( η − µ θ ) T Σ − θ ( η − µ θ ) (cid:3)(cid:27) , (2)where the expected value µ θ and the variance Σ θ are, in general, unknown and p is thedimension of the summary statistics. Equation (2) represents an approximation of the likeli-hood (1), unless the distribution of the summary statistics is indeed Gaussian. The goodnessof the approximation depends on the validity of the asymptotic normality assumption. Theparameters of the Gaussian are approximated by averages over a set of simulated summary4tatistics: ˆ µ θ = 1 N N (cid:88) j =1 η ( j ) sim ˆΣ θ = 1 N − N (cid:88) j =1 ( η ( j ) sim − ˆ µ θ )( η ( j ) sim − ˆ µ θ ) T , where N is the total number of datasets simulated at θ . Thus, N is a trade-oﬀ parameter:as it increases, the approximation becomes more accurate, at the expense of an increasedcomputational burden. This approach produces an approximation of the likelihood, ˆ L s ,termed synthetic likelihood by Wood (2010).Inference for θ can be obtained by directly maximising the synthetic likelihood or byderiving the posterior distribution in a standard Bayesian context, using the approximatelikelihood of the summary statistics as the likelihood and a standard algorithm such as theMetropolis-Hastings algorithm to traverse through the space of θ (Meeds & Welling, 2014).Note that the approximated Gaussian distribution N (ˆ µ θ , ˆΣ θ ) is not an unbiased estimateof the synthetic likelihood N ( µ θ , Σ θ ). Price et al. (2018) analize the use of the unbiasedestimator proposed by Ghurye and Olkin (1969), and that the posterior distribution is ofthe form: π ( θ | µ θ , Σ θ ) ∝ N ( η ; µ θ . Σ θ ) π ( θ ) . Under the assumption that the summary statistics are Gaussian distributed, the syntheticlikelihood allows us to target the posterior distribution π ( θ | µ θ , Σ θ ) but not directly π ( θ | η ),particularly if a transformation of the summary statistics is used to encourage normality.The use of the synthetic likelihood does not require the user to deﬁne a discrepancyfunction between summary statistics, since it is implicitly induced by the use of a multivariateGaussian distribution. However, empirical work shows that the synthetic likelihood can berobust to the violation of normality, as shown in Everitt (2017) and Price et al. (2018). Thequality of the approximation depends on how close the distribution of the summary statisticsis to a Gaussian distribution. In particular, for small sample size the approximation maybe unreliable, see Example 5.2. Recently, a more robust semiparametric version has beenproposed to relax the normality assumptions (An et al., 2018).As described above, the synthetic likelihood can be used as a surrogate for the likelihood5n a Bayesian approach based on standard MCMC algorithms or can be integrated within anABC framework (Meeds & Welling, 2014), by convoluting a kernel measuring the discrepancybetween observed and simulated datasets with respect to the Gaussian synthetic likelihood: L s ( θ ) = (cid:90) K ε ( η , η sim ) N ( η sim ; ˆ µ θ , ˆΣ θ ) dη sim . Here the simulated datasets/summary statistics are assumed to be Gaussian distributed,however it is possible to introduce an additional error term allowing for a diﬀerent model forthe observed summary statistics. If a Gaussian kernel is used, K ε ( η , η sim ) = 1(2 πε ) n/ exp( − ε ( η sim − η ) T ( η sim − η )) , then the approximated likelihood function ˆ L s is again a Gaussian, L s ( θ ) = N (ˆ µ θ , ˆΣ θ + ε I )where I is the identity matrix. By allowing ε →

0, the bias introduced by using simulateddatasets can be reduced. In the latter procedure, the likelihood function depends on a kernelwhich is a function of the diﬀerence between summary statistics, and it is more robust forirregular models, such as those in chaotic dynamic systems, where the Gaussian assumptioncan be seen as too strong. An alternative approach would be estimating the covariancematrix in a robust way, as proposed in Wood (2010).Under this synthetic likelihood approach, the step simulating new datasets can be par-ticularly expensive; various approaches to work more eﬃciently with the synthetic likelihoodhas been proposed, see, for example, Meeds and Welling (2014).

The vanilla Approximate Bayesian Computational algorithm uses a discrepancy measure tocompute the distance among summary statistics between simulated and observed datasets.Algorithm 1 below describes a standard rejection sampling ABC algorithm, see also Pritchardet al. (1999).In this basic version of the algorithm, when the data is discrete, it is possible to considermatching the simulated data with the observed data perfectly, however this approach isnot possible when the data is continuous and it is also highly ineﬃcient as the sample size6 equire:

1. Observed data y ; 2. Prior distribution π ( θ ); 3. Generative model p ( y | θ );4. Discrepancy measure ∆; 5. tolerance value ε :1. Simulate θ ∗ ∼ π ( θ )2. Simulate y sim ∼ p ( y | θ ∗ ) if y sim = y (discrete data) or ∆( η , η sim ) < ε (continuous data) then θ ∗ forms a part of the posterior sample; else Discard θ ∗ . end if

3. Repeat Steps 1 and 2 until enough samples of θ are obtained. Algorithm 1:

Rejection-ABC: Discrepancy measure ∆ is usually taken to be the Euclideandistance; tolerance level ε should be set close to 0 (the particular choice is problem-speciﬁc).increases. Therefore, in practice, matching within a small distance of ε ≥ ε is too large, posterior estimates of θ can be biased, and posterior credibleintervals will be too large. As ε decreases, computational cost increases dramatically, sothere is a trade-oﬀ between accuracy and computational capacity. The use of low-dimensionalsummary statistics is necessary in most applications, despite the fact that their use introducesan additional layer of complexity and potential loss of information from the data. However,well chosen summary statistics can enhance inference (Fan et al., 2013).The most used modiﬁcation to improve computational eﬃciency of Algorithm 1 is theso-called regression adjustment (Beaumont et al., 2002). The main idea of regression ABC isrunning standard ABC with a relatively large threshold level ε and then adjust the obtainedsamples through a regression θ j = f ( η j ) + γ j where j = 1 , . . . , J , J is the total number of values accepted after running Algorithm 1 and γ j is an error term centred around zero. The resulting values then become closer to samplesfrom the posterior distribution as the regression predicts θ at ε = 0. It can be proved that theempirical variance of the adjusted sample is smaller than the empirical variance of the non-adjusted values (Blum, 2010). More recently, Li and Fearnhead (2016) proved that, for an7ppropriate choice of the bandwidth in ABC, standard ABC and regression-adjusted methodslead to an approximate posterior distribution which, asymptotically, correctly quantiﬁes theuncertainty and, in particular, the threshold ε is required to depends on the sample size n in order to achieve such result.Even with adjustments for the output of standard ABC, most parameter values canproduce large distances between summary statistics when simulation is performed from anuninformative prior distribution, so a large number of simulated datasets is needed to identifythe area of the parameter space closest to the true parameter value, and this is computation-ally costly. An alternative approach would be to model the discrepancy measure. Gutmannand Corander (2016) showed that the likelihood function can be approximated by p ( η | θ ) ≈ E [ K ( η , η sim )] , where K ( · ) is a kernel function evaluating the distance among observed and simulated sum-mary statistics. This expected value can then be approximated byˆ L K ( θ ) = 1 N N (cid:88) j =1 K ( η , η ( j ) sim ) , that is, we can repeatedly sample η sim at a given value of θ , and use a kernel density estimatorto approximate the likelihood at θ . The connection between traditional ABC algorithms andkernel density estimation was also explored in Sisson et al. (2018b) and Blum (2010) for anearlier reference.Alternatively, the approximate likelihood function ˆ L K can be written as a function of thediscrepancy, ∆ in Algorithm 1. The most used function is the Uniform kernel, for which K (∆) = c I [0 ,ε ) (∆), where I [ a,b ) ( x ) is an indicator function which is equal to one if x ∈ [ a, b )and c is some constant. Then the likelihood function can be approximated byˆ L K ( θ ) = ˆPr(∆( η , η sim ) ≤ ε ) , i.e. the empirical probability that the discrepancy measure is smaller than a threshold ε .Gutmann and Corander (2016) showed that maximising the synthetic likelihood (Section 2)corresponds to maximising a lower bound of a nonparametric approximation of the likelihood L K ( θ ). 8utmann and Corander (2016) models the discrepancy ∆( η , η sim ) as a Gaussian process,using a squared exponential covariance function and uncorrelated Gaussian noise, withinthe framework of a Bayesian optimization algorithm (Williams & Rasmussen, 2006). Theresulting algorithm is deﬁned as Bayesian optimization for likelihood-free inference (BOLFI).For an application in population genetics see Numminen et al. (2016). Since the discrepancyis a positive function, it is possible to consider a transformation, g ( · ) : R + → R , of it whichwill be more likely to follow a Gaussian distribution, for example the logarithmic transform,so that Pr( g [∆( η , η sim )] ≤ ε (cid:48) ). However, the Gaussian assumption may not hold in general,in particular, because the variance of the discrepancy is likely to vary over the parameterspace; therefore, the Gaussian process model for the discrepancy may aﬀect the accuracy ofthe ABC posterior estimation when some of these assumptions are not met.J¨arvenp¨a¨a et al. (2018) proposed three methods using a Gaussian process to model thediscrepancy measure: • The discrepancy measure ∆ can be assumed to follow a Gaussian distribution ∆( η , η sim ) ∼N ( f ( θ ) , σ ), i.e. with constant variance. The mean is modelled as a Gaussian processof mean m ( θ ) and covariance structure k ( θ, θ (cid:48) ) given by some function of the distancebetween values, for example a squared exponential function. In this setting, P r (∆( η , η sim ) ≤ ε ) = Φ (cid:32) ε − µ ( θ ) (cid:112) v ( θ ) + σ (cid:33) where Φ( · ) is the cumulative distribution function of a standard Gaussian variable and µ ( θ ) and v ( θ ) are the posterior mean and variance of the function f ( θ ). • The variance of the discrepancy can be allowed to vary over the parameter space, sothat ∆( η , η sim ) ∼ N ( f ( θ ) , σ exp( g ( θ ))). In addition to a prior model for the meanfunction f ( θ ) (which can be again assumed to be a Gaussian process), also the varianceor a function of it, e.g. log( g ( θ )), needs to be modelled; if it is reasonable to assumethat it changes smoothly as a function of θ , a Gaussian process can be again imposedas prior distribution. In this setting, P r (∆( η , η sim ) ≤ ε ) = Φ (cid:32) ε − µ ( θ ) (cid:112) v ( θ ) + σ exp( g ( θ )) (cid:33) . Instead of directly modelling the discrepancy measure, it is possible to associate itto a latent variable Z , such that Z = 2 I ∆( η ,η sim ) ≤ ε − {− , +1 } ,following a probit or a logit model: Pr( z | f ( θ )) = g ( z, f ( θ )) and where the function f ( θ ) ∼ N ( m ( θ ) , k ( θ, θ (cid:48) )). The diﬀerence between this version of the algorithm andthe two previous ones is that here the discriminative function is modelled as a smoothfunction, but no assumption on the form of the distribution of the discrepancy measureis made.An alternative would be to consider a Student’t t distribution, as in Shah et al. (2014).The accuracy of the estimation depends on how well a Gaussian distribution can model thedistribution of the discrepancy measure: if the discrepancy is roughly Gaussian distributed,the standard GP or the heteroskedastic version can be a good representations, while, as itmoves away from normality (with multimodality or heavy tails), the classiﬁcation approach,the Student’s t distribution or other algorithms as the ones presented in Section 4 can leadto better estimates. Since the goodness of ﬁt of each of these models depends on the speciﬁcproblem at hand, it is possible to compare several models through model selection techniquesas part of the analysis. J¨arvenp¨a¨a et al. (2018) proposes two utility functions along the lineof Vehtari and Ojanen (2012): the mean of the log-predictive density, which measures howwell the Gaussian process predicts the distribution of the discrepancies, and the classiﬁerutility, which penalizes realisations of the discrepancy that are under the threshold.The approaches based on using Gaussian process priors for the discrepancy measure arediﬀerent from Wilkinson (2014), where the likelihood is directly modelled as a Gaussianprocess, or Meeds and Welling (2014) where each element of the intractable mean and co-variance matrix is modelled as a Gaussian process. Clearly here the choice of the discrepancyis essential, as it aﬀects the goodness of ﬁt of the Gaussian process and, therefore, the qualityof the approximation. The standard choice is the Euclidean distance.A nonparametric approach (based on the deﬁnition of a discrepancy measure betweensummary statistics) can be more accurate than a parametric approach as in Section 2 whenthe summary statistics η ( · ) is low-dimensional; however, as the number of summary statisticsincreases, the accuracy of the algorithms tends to deteriorate. The methods just describeduse either kernel density estimates or Gaussian processes to model the distribution of the10iscrepancy measure in order to reduce the assumptions on the distribution of the summarystatistics. A completely nonparametric alternative to these methods within the ABC frame-work involve making use of an empirical likelihood approach (Owen, 1988; Mengersen etal., 2013). The empirical likelihood is a nonparametric estimator of the likelihood function.Given a set of independent and identical distributed observations y i , i = 1 , . . . , n from adistribution F , the empirical likelihood function is deﬁned as a set of weights L el ( p ) = max p i n (cid:89) i =1 p i where p i are obtained under constraints (cid:80) ni =1 p i = 1, 0 < p i < (cid:80) ni =1 p i h ( y i , θ ) = 0.Here, (cid:80) ni =1 p i h ( y i , θ ) = 0 is a moment condition. Extension to non-i.i.d. settings are avail-able in Owen (2001) and Schennach (2005); Grend´ar and Judge (2009) provide a Bayesianjustiﬁcation of this procedure. The empirical likelihood approach deﬁnes a set of weights forthe values of the parameter of interest. By combining simulations from the prior distributionwith the weights deﬁned through the empirical likelihood, it is possible to obtain an approx-imate sample from the posterior distribution. With respect to standard ABC methods, thisapproach avoids the deﬁnition of summary statistics, whose relationship with the parametersof the model is, in general, unknown. The method can be applied in settings where modelmiss-speciﬁcation might lead to strongly biased estimates (Grazian & Liseo, 2017). Howeverthe deﬁnition of unbiased estimating equations h ( · ) is not always straightforward. As has been discussed, standard ABC algorithms often rely on the deﬁnition of a similaritythreshold ε and on the approximation of the posterior distribution π ( θ | y ) with a distributionconditional on the similarity among datasets π ( θ | ∆( η , η sim ) ≤ ε ). However, this approachpresents some drawbacks, since the accuracy increases only when ε →

0, at the expense of anincreased computational cost. The lack of scalability to higher dimensions is well-recognisedin the ABC literature (Sisson et al., 2018a), this drives more recent research in the directiontowards scalable and more user friendly techniques.11n active stream of current research focuses on treating the likelihood in Equation 1as a regression density estimation problem, using simulations of both the parameters andthe summary statistics to train the conditional density. The resulting estimated densityis then an analytically tractable approximation of the likelihood function then used for aBayesian analysis. Note that directly approximating the likelihood involves an additionalstep to compute either the posterior or maximising the likelihood to obtain the MLE for θ .Such approaches may be preferable in problems where, for example, inference is required formultiple datasets arising from the same model.Fan et al. (2013) describes a ﬂexible conditional density estimation approach where theapproximation is constructed from a sample of N summary statistic and parameter pairs( η , θ ) , . . . , ( η N , θ N ) drawn from a distribution p ( η | θ ) h ( θ ). Note that unlike the standardABC algorithms such as the rejection-ABC described in Algorithm 1, while the summarystatistics are generated given θ from the sampling distribution for the intractable model ofinterest, the parameters are not necessarily generated from the prior. Instead, h ( θ ) is adistribution chosen to reﬂect the region in the vicinity of η , since interest is in the posteriordistribution π ( θ | η ). Some rough knowledge of the high likelihood region of the parameterspace is needed. Fan et al. (2013) suggest initial pilot analysis for setting h ( θ ).Since a good approximation of the conditional distribution is crucial in this approach,two issues in particular should be considered in designing the density estimator. First, therelationship between η and θ can be complex, careful selection of summary statistics mayhelp simplify this relationship. Second, when the dimension of the summary statistic is large,density estimation for the joint distribution is diﬃcult. Fan et al. (2013) advocated a twostep approach.The ﬁrst step is to build marginal regression models for each component of η = ( η , . . . , η K )conditional on θ . A mixture of experts model was used for this purpose coupled with a vari-ational method for ﬁtting of the model based on sampled data ( η j , θ j , j = 1 . . . , N ), wherethe regression density estimator for each component η k takes the form f k ( η k | θ ) = M (cid:88) m =1 w km N ( µ km ( θ ) , σ km ( θ )) , for some appropriately chosen value M , indicating the total number of mixture components12o be used, and w km ( θ ) = exp( ξ km +( ξ km ) T θ ) (cid:80) Mm =1 exp( ξ km +( ξ km ) T θ ) , µ km ( θ ) = β km + ( β km ) T θ and log( σ km ( θ ) = γ km + ( γ km ) T θ .Then, a conditional density estimate for the joint distribution of η given θ is constructed.The data ( η j , θ j ) are transformed to ( U j , θ j ), where U jk = Φ − ( ˆ F k ( η jk | θ j )), ˆ F k ( η k | θ ) is thedistribution function corresponding to the density ˆ f k ( η k | θ ). If the marginal densities for each η k are well estimated, the transformation to U j makes each component of U j approximatelystandard Gaussian regardless of the value of θ . A mixture of Gaussian distributions isthen ﬁtted to the data ( U j , θ j ), j = 1 , . . . , N . The joint density of ( U, θ ) is a mixture ofmultivariate Gaussians taking the form g ( U, θ ) = L (cid:88) l =1 w l N ( µ l , Ψ l )with L Gaussian mixture components, and w l , µ l and Ψ l are respectively the weight, meanand covariance matrix corresponding to the l th mixture component.The conditional distribution of U | θ , implied by the multivariate Gaussian mixture of thejoint distribution, is again a mixture of Gaussians, denoting this new mixture by g ( U | θ ) = L (cid:88) l =1 w cl N ( µ cl , Ψ cl )where µ cl and Ψ cl are the conditional mean and covariance of U | θ in the l th component of themultivariate N ( µ l , Ψ l ). The mixing weights for the conditional distribution is given as w cl = w l φ ( θ, µ l , Ψ l ) (cid:80) Mm =1 w m φ ( θ, µ m , Ψ m ) , where φ ( θ, µ l , Ψ l ) denotes the multivariate Gaussian density in θ with mean µ l and Ψ l impliedby g ( U, θ ).Finally, the approximated likelihood of η | θ is derived through back-transformation, viaˆ L ( η | θ ) = ˆ g ( U | θ ) K (cid:89) k =1 ˆ f k ( η k | θ ) φ ( U k ; , , . The rationale behind the two stage approach is that we can estimate the marginal distribu-tions well without a huge amount of data, and this in turn will improve estimates on thejoint distribution, based on a moderate amount of data.13ore recent developments have focussed on the use of conditional neural density estima-tors. In short, a neural density estimator is a parametric model q φ , for example a neuralnetwork parameterised by the weights φ . The function q φ is trained by maximising the to-tal log probability (cid:80) Nj =1 log q φ ( y j | θ j ) with respect to φ . Given suﬃcient training data anda suﬃciently ﬂexible model, q φ ( y | θ ) will approximate the conditional distribution p ( y | θ ).Papamakarios et al. (2019) proposed a sequential neural density estimator using conditionalautoregressive ﬂows (Papamakarios et al., 2017) aimed at a general purpose solution, with anadaptive online scheme for the choice of a sampling distribution h ( θ ). Recall that the prioris generally too diﬀuse to be informative enough to sample θ in the region corresponding to y . Clearly, any conditional density estimation to approximate the likelihood can be used toapproximate the posterior distribution p ( θ | y ) directly. However, direct approximation of theposterior distribution is complicated by the need to sample θ from h ( θ ) rather than the prior π ( θ ). A sequential neural posterior estimator was used, for example, in Papamakarios andMurray (2016), who used a Mixture Density Network (MDN), i.e. a mixture of K Gaussiandistributions whose parameters are estimated by a neural network, while Lueckmann etal. (2018) and Lueckmann et al. (2017) used multi-layer neural networks. The samplingdistribution h ( θ ) is obtained over several quick iterations of posterior density estimation,with the improved estimate of h ( θ ) being based on the previous estimate of the posteriordistribution.Finally, a correction is required to account for the fact that the samples of θ are notdrawn from the prior. The conditional density function approximates˜ q φ ( θ | y ) ∝ h ( θ ) π ( y | θ ) , which is not the true posterior; to obtain an approximation of the posterior distribution, itis necessary to weight it as ˆ π ( θ | y ) ∝ π ( θ ) h ( θ ) ˜ q φ ( θ | y ) . (3)where ˜ q φ ( θ | y ) can be a neural network parameterised by φ and trained on samples of θ and y . Since reweighting (Equation 3) introduces extra variance into the estimation, Greenberget al. (2019) proposed an automatic posterior transform method called “automatic posterior14ransformation” (APT). Let a proposal posterior be deﬁned asˆ q φ ( θ | y ) ∝ q φ ( θ | y ) h ( θ ) π ( θ )where q φ ( θ | y ) is an estimate of the true posterior π ( θ | y ), and training is carried out byminimizing the objective function − (cid:80) Nj =1 log ˆ q φ ( θ j | y i ) with respect to φ ; the procedurerecovers both the true posterior and the proposal posteriors.Other related works include Alsing et al. (2019) who uses neural density estimators in thecontext of problems encountered in cosmology; Radev et al. (2019) showed a convolutionalnetwork in an ABC setting can be trained to directly obtain the posterior mean and variance;Bonassi et al. (2011) used multivariate Gaussian mixture models for the density estimator inthe context of statistical genetic models; and Izbicki et al. (2019) provided a nonparametricdensity estimation method aimed at high dimensional data. We now compare some of the techniques presented in the previous Sections on two diﬀerentexamples. First, we analyze data simulated from a Gaussian generative model; this is abenchmark, since in this case it is possible to choose a suﬃcient summary statistic and there isno loss of information in ABC procedures. The second example uses the Ricker model, whichis well-known in the ABC literature since its use in the work of Wood (2010). Both examplespresented are based on simulations and we repeated the simulations 250 times for each model,in order to study the frequentist behaviours of the procedures analyzed. All codes used inthis Section are available at the website https://github.com/cgrazian/ABC review . We ﬁrst consider a simple setting of observations generated from a Gaussian distributionwith known variance. Therefore, the simulated data can be simply generated by y sim = θ + γ γ ∼ N ( , I n )where I n is the identity matrix of dimension n × n .15n this context, it is reasonable to choose the sample mean as summary statistic η = n (cid:80) ni =1 y ( i )0 and η sim = n (cid:80) ni =1 y ( i ) sim , which is also suﬃcient for θ , therefore this is a case whereABC targets the posterior distribution conditional on the observations and not only thesummary statistics. In particular, the summary statistic η follows a Gaussian distribution, η sim ∼ N ( θ, n ).The synthetic likelihood isˆ L s ( θ ) = (cid:16) n π (cid:17) / exp (cid:26) −

12 ( η − θ − g ) (cid:27) (4)where g is a random variable such that g ∼ N (0 , / ( nN )), and the estimator for θ followsagain a Gaussian distribution: ˆ θ s ∼ N ( η , / ( nN )).On the other hand, when using a nonparametric approximation, it is again reason-able to use the sample mean as the summary statistic in Algorithm 1. Suppose that∆( η , η sim ) = ( η − η sim ) . The probability that the discrepancy is lower than the thresholdlevel ε approximates the likelihood function:Pr (∆( η , η sim ) ≤ ε ) = Pr (cid:0) ( η − η sim ) ≤ ε (cid:1) = Pr (cid:0) ( η − η sim ) ≥ √ ε (cid:1) + Pr (cid:0) ( η − η sim ) ≤ √ ε (cid:1) = Φ (cid:0) √ n ( η − θ ) + √ nε (cid:1) − Φ (cid:0) √ n ( η − θ ) − √ nε (cid:1) where Φ( · ) is the cumulative distribution function of a standard Gaussian variable. Therefore,when nε is small, the likelihood function of the approximate sample isˆ L ε ( θ ) ∝ Pr (∆( η , η sim ) ≤ ε ) ∝ √ εL ( θ ) , (5)which means that the likelihood function of the parameter θ is well approximate for ε smallenough. However, for small ε the acceptance rate is also small and a large amount of simu-lated values may be needed to obtain a good approximation of the posterior distribution of θ . Figure 1 shows the approximations of the posterior distribution of θ obtained with syn-thetic likelihood as described in Equation (4); standard ABC following Algorithm 1 and theBayesian optimization for likelihood-free inference (BOLFI) as in Equation (5). The shadedarea shows the 95% variability intervals obtained by taking pointwise quantiles at the 0.025and 0.975 level, over the 250 repeated simulations. In particular, the sample mean is cho-sen as the summary statistic for implementation of standard ABC and synthetic likelihood,16oreover standard ABC is performed by keeping the Euclidean distance as discrepancy mea-sure. The threshold ε is chosen, here and in the following example, by performing a pilot runwhere the threshold is ﬁxed at a relatively large value and then set at the 0.05 quantile of theempirical distribution of the distance among simulated and observed summary statistics inthe pilot experiment. On the other hand, BOLFI is applied by using again the sample meanas the summary statistic, the Euclidean distance as discrepancy measure and a logarithmictransformation of the discrepancy to encourage normality. The ﬁrst algorithm described inSection 3 is performed, such that the discrepancy measure is assumed to follow a Gaussiandistribution with constant variance. Moreover, the parameter θ is supposed to a priori bedeﬁned between [ − ,

20] and we chose 20 initialization points sampled straight from theprior before starting to optimize the acquisition of samples; in order to save computationaltime, we decided to update the Gaussian process every 10 samples. All three algorithmshave been applied to produce 1000 samples of accepted parameter values. Computationswere performed on a computer with a 2.5 GHz Intel Core i5 processor with 4GB 1600 MhzDDr3 of RAM, the CPU times are: 9.42s for synthetic likelihood (with 1000 summary statis-tics simulated), 1min 54s for standard ABC (with 20,000 simulated values), 1min 47s forBOLFI (4 chains of 1000 simulations with the ﬁrst 75% simulations used as burnin).In this case, the synthetic likelihood produces the best approximation, closely followingthe true posterior distribution (solid lines), while the standard ABC approach was the leastaccurate, even the posterior mean here is not close to the truth, indicating the need to furtherreduce the tolerance ε . BOLFI represents an intermediate situation, slightly over-estimatingthe spread of the distributions. In this setting, a procedure based on an assumption ofnormality, as for the synthetic likelihood and BOLFI, increase the goodness of the approxi-mation. As there is no loss of information in using the sample mean as summary statistics,the posterior distribution obtained by using the synthetic likelihood is approximating thetrue posterior distribution. The approximation obtained with BOLFI has an additional levelof error due to the chosen threshold, but the normality assumption improves the approxima-tion with respect to the standard ABC. Results can be improved by increasing the numberof simulations. 17igure 1: Approximations of the posterior distributions for θ based on synthetic likelihood(left), standard ABC (centre) and BOLFI (right) for the Gaussian example, with unknownlocation parameter and known variance. Solid line represents the true posterior distribution,dotted lines are the averages from 250 repetitions of the experiment, and the shaded areashows corresponding 95% pointwise variability intervals.18 .2 Ricker model We now consider the well known Ricker model, introduced in Ecology by Ricker (1954).Suppose that the number of animals from a particular species is y and depends on theentire population which evolves dynamically. In particular, letlog N ( t ) = log r + log N ( t − − N ( t − + σe ( t ) (6)where N ( t ) is the unknown population at time t , log r is the log-growth rate, σ is the standarddeviation of the innovation and e ( t ) are independent Gaussian errors. By assumption, N (0) =0. Then, the observed population at time t is a Poisson random variable and the generativemodel can be described as y ( t ) sim ∼ P oi ( φN ( t ) ) (7)where φ is a scaling parameter. The likelihood function is diﬃcult to evaluate, given thenonlinearity of the state equation and the fact that at every time t it is necessary to integrateout the unobserved population N t .Following Gutmann and Corander (2016), we simulate data from model (6) and (7) byﬁxing σ = 0 . φ = 10 and log r = 4. Inference is focused on the log-growth rate, while σ and φ are considered nuisance parameters. Figure 2 shows one example of dataset generatedfrom the Ricker model with these choices of parameters and used to compare the algorithms.In order to apply the algorithms described in Section 2 and Section 3, we perform inferencewith two sets of summary statistics: one set with ﬁve summary statistics, some of whichcan be considered (almost)-Gaussian distributed and a larger set of 13 summary statistics(proposed by Wood (2010)), whose distributions are less obviously Gaussian. The ﬁrst setinclude: the number of observations greater than 10, the median count, the maximum count,the quantile of level q = 0 .

75 and the sample mean of the observations greater than 1. Thesecond set include: the mean, the number of zeros, the autocovariances up to lag 5 (includinglag 0), the parameter estimates of the regression model y ( t ) . = β y ( t − . + β y ( t − . + ε t

10 20 30 40 50 time ob s e r v ed c oun t Figure 2: An example of simulated data from the Ricker model with log r = 4, φ = 10 and σ = 0 .

3. 20ith ε t ∼ N (0 , σ ), and the coeﬃcients of a cubic regression of the ordered diﬀerences ontheir observed values.A threshold ε = 5 . r ∼ U (3 , φ ∼ U (0 ,

20) and σ ∼ U (0 , . ε = 4 . r ) based on synthetic likeli-hood (left), standard ABC (centre) and BOLFI (right) for the Ricker model example. Toprow obtained from 5 summary statistics, and bottom row 13 summary statistics. Solid ver-tical lines represents the true value, dotted lines are the averages from 250 repetitions of theexperiment and the shaded areas show the 95% pointwise variability intervals.22n general, increasing the number of summary statistics and reducing the tolerance levelwill help improve rejection ABC and BOLFI, while the synthetic likelihood approach ismuch more dependent on the normality assumptions on the summary statistics. It has tobe noticed that in this example there is a large diﬀerence in computational times: one ofthe appealing characteristics of rejection ABC methods and methods based on the syntheticlikelihood is that they are easily parallelizable. On the other hand, the sequential nature ofBOLFI makes it computationally demanding, as it happens in this example. The advantageof BOLFI is that it relaxes the assumptions of the synthetic likelihood, while maintaininga parametric model. This can be useful in situations where standard ABC methods areextremely ineﬃcient due to the simulation from a vague prior distribution or when it isdiﬃcult to identify regions of the parameter space of high posterior density. In this example,the chosen prior was in a region of high posterior density region, therefore the standard ABCalgorithm was fast, however for many problems, it may not be possible or desirable to use aninformative prior. In conclusion, we believe that BOLFI is an intermediate solution betweenthe parametric approach of the synthetic likelihood and the standard ABC approach, as itis also evident from the results in Figure 1. There are many modiﬁcations designed to increase the computational eﬃciency of Algorithm1. In general, simulating from areas of the parameter space where the likelihood functionis not negligible increases the computational eﬃciency, as in the population Monte-Carloapproach of Marin et al. (2012) or the sequential version of Sisson et al. (2007). See alsoWilkinson (2014), Drovandi et al. (2018) and J¨arvenp¨a¨a et al. (2019) for eﬃcient ways tosimulate parameter values.Working with the summary statistics can also improve computational eﬃciency as well asestimation accuracy. For example, one can deﬁne a large number of summary statistics andchoose the summary statistics in Algorithm 1 as combinations of them (usually determinedvia regression); these approaches may be found in Nunes and Balding (2010), Fearnhead andPrangle (2012), Aeschbacher et al. (2012) and Blum et al. (2013).23n active area of research not covered in this review focuses on the choice of discrepancymeasure based on classiﬁcation accuracy, as the need for comparing simulated and observeddatasets may be seen as a classiﬁcation problem, and this creates a natural connection withapproaches developed in machine learning, see for example Gutmann et al. (2018). Thebasic idea is that it should be straightforward to discriminate between datasets which havebeen generated by very diﬀerent values of the parameters. In this case, the choice of thesummary statistics is replaced by the choice of a classiﬁcation method. However, while theuse of summary statistics may lead to a loss of information, the classiﬁcation rule is learnedfrom the data, therefore its choice has a smaller impact on the quality of the approximation.While a thorough analysis and comparison of classiﬁcation methods has not been performedyet, it is likely that the performance of a classiﬁcation method will be problem dependent.Finally, we have not implemented the methods in Sections 4, but we feel this directioncould take ABC to a higher level: to the realm of big data and high dimensions and lessuser-speciﬁc tuning.

Acknowledgements

The authors thank Michael Gutmann and Henri Personen for useful conversations. In par-ticular, the authors thank Henri Personen for his help with the code used for BOLFI in theexamples.The work of the ﬁrst author has been partially developed under the PRIN2015 supported-project: “Environmental processes and human activities: capturing their interactions via sta-tistical methods (EPHAStat)”, funded by MIUR (Italian Ministry of Education, Universityand Scientiﬁc Research) (20154X8K23-SH3). The second author is grateful for the sup-port of the Australian Reearch Council Center of Excellence for Mathematical and StatistialFrontiers.

References

Aeschbacher, S., Beaumont, M. A., & Futschik, A. (2012). A novel approach for choosing24ummary statistics in approximate Bayesian computation.

Genetics , (3), 1027–1047.Alsing, J., Charnock, T., Feeney, S., & Wandelt, B. (2019). Fast likelihood-freecosmology with neural density estimators and active learning. arXiv preprintarXiv:1903.00007v1 .An, Z., Nott, D. J., & Drovandi, C. (2018). Robust bayesian synthetic likelihood via asemi-parametric approach. arXiv preprint arXiv:1809.05800 .Beaumont, M. A. (2010). Approximate Bayesian computation in evolution and ecology. Annual review of ecology, evolution, and systematics , , 379–406.Beaumont, M. A., Zhang, W., & Balding, D. J. (2002). Approximate Bayesian computationin population genetics. Genetics , (4), 2025–2035.Blum, M. G. (2010). Approximate Bayesian computation: A nonparametric perspective. Journal of the American Statistical Association , (491), 1178–1187.Blum, M. G. (2017). Regression approaches for approximate Bayesian computation. arXivpreprint arXiv:1707.01254 .Blum, M. G., & Fran¸cois, O. (2010). Non-linear regression models for Approximate BayesianComputation. Statistics and Computing , (1), 63–73.Blum, M. G., Nunes, M. A., Prangle, D., & Sisson, S. A. (2013). A comparative reviewof dimension reduction methods in approximate Bayesian computation. StatisticalScience , (2), 189–208.Bonassi, F. V., You, L., & West, M. (2011). Bayesian learning from marginal data inbionetwork models. Statistical Applications in Genetics and Molecular Biology , (1).Bortot, P., Coles, S. G., & Sisson, S. A. (2007). Inference for stereological extremes. Journalof the American Statistical Association , (477), 84-92.Cameron, E., & Pettitt, A. (2012). Approximate Bayesian Computation for astronomicalmodel analysis: a case study in galaxy demographics and morphological transformationat high redshift. Monthly Notices of the Royal Astronomical Society , (1), 44–65.Cornuet, J.-M., Santos, F., Beaumont, M. A., Robert, C. P., Marin, J.-M., Balding, D. J.,. . . Estoup, A. (2008). Inferring population history with DIY ABC: a user-friendlyapproach to approximate Bayesian computation. Bioinformatics , (23), 2713–2719.25reel, M., & Kristensen, D. (2015). ABC of SV: Limited information likelihood inference instochastic volatility jump-diﬀusion models. Journal of Empirical Finance , , 85–108.Drovandi, C. C., Moores, M. T., & Boys, R. J. (2018). Accelerating pseudo-marginal MCMCusing Gaussian processes. Computational Statistics & Data Analysis , , 1–17.Evensen, G. (2009). Data assimilation: the ensemble Kalman ﬁlter . Springer Science &Business Media.Everitt, R. G. (2017). Bootstrapped synthetic likelihood. arXiv preprint arXiv:1711.05825 .Fan, Y., Nott, D. J., & Sisson, S. A. (2013). Approximate Bayesian computation viaregression density estimation.

Stat , (1), 34–48. Retrieved from http://dx.doi.org/10.1002/sta4.15 doi: 10.1002/sta4.15Fasiolo, M., Pya, N., & Wood, S. N. (2016). A comparison of inferential methods for highlynonlinear state space models in ecology and epidemiology. Statistical Science , (1),96–118.Fearnhead, P., & Prangle, D. (2012). Constructing summary statistics for approximateBayesian computation: semi-automatic approximate Bayesian computation. Journalof the Royal Statistical Society: Series B (Statistical Methodology) , (3), 419–474.Ghurye, S., & Olkin, I. (1969). Unbiased estimation of some multivariate probability densitiesand related functions. The Annals of Mathematical Statistics , (4), 1261–1271.Gourieroux, C., Monfort, A., & Renault, E. (1993). Indirect inference. Journal of appliedeconometrics , (S1), S85–S118.Grazian, C., & Liseo, B. (2017). Approximate Bayesian inference in semiparametric copulamodels. Bayesian Analysis , (4), 991–1016.Greenberg, D. S., Nonnenmacher, M., & Macke, J. H. (2019). Automatic posterior transfor-mation for likelihood-free inference. arXiv preprint arXiv:1905.07488v1 .Grend´ar, M., & Judge, G. (2009). Asymptotic equivalence of empirical likelihood andBayesian MAP. The Annals of Statistics , (5A), 2445–2457.Gutmann, M. U., & Corander, J. (2016). Bayesian optimization for likelihood-free inferenceof simulator-based statistical models. The Journal of Machine Learning Research , (1), 4256–4302.Gutmann, M. U., Dutta, R., Kaski, S., & Corander, J. (2018). Likelihood-free inference via26lassiﬁcation. Statistics and Computing , (2), 411–425.Itan, Y., Powell, A., Beaumont, M. A., Burger, J., & Thomas, M. G. (2009). The origins oflactase persistence in Europe. PLoS computational biology , (8), e1000491.Izbicki, R., Lee, A. B., & Pospisil, T. (2019). ABC-CDE: Toward Approximate BayesianComputation with complex high-dimensional data and limited simulations. Journal ofComputational and Graphical Statistics , (0), 1-20. Retrieved from https://doi.org/10.1080/10618600.2018.1546594 doi: 10.1080/10618600.2018.1546594J¨arvenp¨a¨a, M., Gutmann, M. U., Pleska, A., Vehtari, A., & Marttinen, P. (2019). Eﬃ-cient acquisition rules for model-based approximate Bayesian computation. BayesianAnalysis , (2), 595–622.J¨arvenp¨a¨a, M., Gutmann, M. U., Vehtari, A., & Marttinen, P. (2018). Gaussian processmodelling in approximate Bayesian computation to estimate horizontal gene transferin bacteria. The Annals of Applied Statistics , (4), 2228–2251.Lei, J., & Bickel, P. (2009). Ensemble ﬁltering for high dimensional non-linear state spacemodels. University of California, Berkeley, Rep , , 23.Leuenberger, C., & Wegmann, D. (2010). Bayesian computation and model selection withoutlikelihoods. Genetics , (1), 243–252.Li, W., & Fearnhead, P. (2016). Improved convergence of regression adjusted approximateBayesian computation. arXiv preprint arXiv:1609.07135 .Luciani, F., Sisson, S. A., Jiang, H., Francis, A. R., & Tanaka, M. M. (2009). The epidemi-ological ﬁtness cost of drug resistance in mycobacterium tuberculosis. Proceedings ofthe National Academy of Sciences , (34), 14711–14715.Lueckmann, J.-M., Bassetto, G., Karaletsos, T., & Macke, J. H. (2018). Likelihood-freeinference with emulator networks. arXiv preprint arXiv:1805.09294 .Lueckmann, J.-M., J., G. P., Bassetto, G., ¨Ocal, K., Nonnenmacher, M., & Macke, J. H.(2017). Flexible statistical inference for mechanistic models of neural dynamics. In Advances in neural information processing systems 30 (p. 1289-1299).Marin, J.-M., Pudlo, P., Robert, C. P., & Ryder, R. J. (2012). Approximate Bayesiancomputational methods.

Statistics and Computing , (6), 1167–1180.Marin, J.-M., Raynal, L., Pudlo, P., Ribatet, M., & Robert, C. P. (2019). ABC random27orests for Bayesian parameter inference. Bioinformatics , (10).Marjoram, P., Molitor, J., Plagnol, V., & Tavar´e, S. (2003). Markov chain Monte Carlowithout likelihoods. Proc. Natl. Acad. Sci. USA , , 15324 – 15328.Meeds, E., & Welling, M. (2014). GPS-ABC: Gaussian process surrogate approximateBayesian computation. arXiv preprint arXiv:1401.2838 .Mengersen, K. L., Pudlo, P., & Robert, C. P. (2013). Bayesian computation via empiricallikelihood. Proceedings of the National Academy of Sciences , (4), 1321–1326.Nott, D. J., Drovandi, C. C., Mengersen, K., & Evans, M. (2018). Approximation of Bayesianpredictive p -values with regression ABC. Bayesian Analysis , (1), 59–83.Nott, D. J., Fan, Y., Marshall, L., & Sisson, S. (2014). Approximate Bayesian computationand Bayes linear analysis: toward high-dimensional ABC. Journal of Computationaland Graphical Statistics , (1), 65–86.Nott, D. J., Marshall, L., & Tran, M. N. (2012). The ensemble Kalman ﬁlter is an ABCalgorithm. Statistics and Computing , (6), 1273–1276.Numminen, E., Gutmann, M., Shubin, M., Marttinen, P., Meric, G., van Schaik, W., . . .Sheppard, S. K. (2016). The impact of host metapopulation structure on the populationgenetics of colonizing bacteria. Journal of theoretical biology , , 53–62.Nunes, M. A., & Balding, D. J. (2010). On optimal selection of summary statistics forapproximate Bayesian computation. Statistical applications in genetics and molecularbiology , (1).Owen, A. B. (1988). Empirical likelihood ratio conﬁdence intervals for a single functional. Biometrika , (2), 237–249.Owen, A. B. (2001). Empirical likelihood . Chapman and Hall/CRC.Palero, F., Lopes, J., Abell´o, P., Macpherson, E., Pascual, M., & Beaumont, M. A. (2009).Rapid radiation in spiny lobsters (palinurus spp) as revealed by classic and ABC meth-ods using mtDNA and microsatellite data.

BMC Evolutionary Biology , (1), 263.Papamakarios, G., & Murray, I. (2016). Fast ε -free inference of simulation models withBayesian conditional density estimation. In Advances in neural information processingsystems (pp. 1028–1036).Papamakarios, G., Pavlakou, T., & Murray, I. (2017). Masked autoregressive ﬂow for density28stimation. In

Neural information processing systems 30 (p. 2338-2347).Papamakarios, G., Sterratt, D. C., & Murray, I. (2019). Sequential neural likelihood: Fastlikelihood-free inference with autoregressive ﬂows. arXiv preprint arXiv:1805.07226v2 .Price, L. F., Drovandi, C. C., Lee, A., & Nott, D. J. (2018). Bayesian synthetic likelihood.

Journal of Computational and Graphical Statistics , (1), 1–11.Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., & Feldman, M. W. (1999). Populationgrowth of human Y chromosomes: a study of Y chromosome microsatellites. Molecularbiology and evolution , (12), 1791–1798.Radev, S. T., Mertens, U. K., Voss, A., & K¨othe, U. (2019). Towards end-to-end likelihood-free inference with convolutional neural networks. British Journal of Mathematical andStatistical Psychology . doi: 10.1111/bmsp.12159Ricker, W. E. (1954). Stock and recruitment.

Journal of the Fisheries Board of Canada , (5), 559–623.Rodrigues, G., Nott, D. J., & Sisson, S. A. (2016). Functional regression approximateBayesian computation for Gaussian process density estimation. Computational Statis-tics & Data Analysis , , 229–241.Saulnier, E., Gascuel, O., & Alizon, S. (2017). Inferring epidemiological parameters fromphylogenies using regression-ABC: A comparative study. PLoS computational biology , (3), e1005416.Schennach, S. M. (2005). Bayesian exponentially tilted empirical likelihood. Biometrika , (1), 31–46.Shah, A., Wilson, A., & Ghahramani, Z. (2014). Student-t processes as alternatives toGaussian processes. In Artiﬁcial intelligence and statistics (pp. 877–885).Sisson, S. A., & Fan, Y. (2018). ABC samplers. In S. A. Sisson, Y. Fan, & M. A. Beaumont(Eds.),

Handbook of Approximate Bayesian Computation (p. 87- 124). Chapman &Hall/CRC Press.Sisson, S. A., Fan, Y., & Beaumont, M. A. (2018a).

Handbook of Approximate BayesianComputation . Chapman & Hall/CRC Press.Sisson, S. A., Fan, Y., & Beaumont, M. A. (2018b). Overview of ABC. In S. A. Sisson,Y. Fan, & M. A. Beaumont (Eds.),

Handbook of Approximate Bayesian Computation

Proceedings of the National Academy of Sciences , (6), 1760–1765.Tavar´e, S., Balding, D. J., Griﬃths, R. C., & Donnelly, P. (1997). Inferring coalescencetimes from DNA sequence data. Genetics , (2), 505-518.Vehtari, A., & Ojanen, J. (2012). A survey of Bayesian predictive methods for modelassessment, selection and comparison. Statistics Surveys , , 142–228.Wilkinson, R. D. (2013). Approximate Bayesian computation (ABC) gives exact resultsunder the assumption of model error. Statistical applications in genetics and molecularbiology , (2), 129–141.Wilkinson, R. D. (2014). Accelerating ABC methods using Gaussian processes. arXivpreprint arXiv:1401.1436 .Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning (Vol. 2) (No. 3). MIT Press Cambridge, MA.Wood, S. N. (2010). Statistical inference for noisy nonlinear ecological dynamic systems.

Nature ,466