[PDF] Transformations in Semi-Parametric Bayesian Synthetic Likelihood

Abstract

Bayesian synthetic likelihood (BSL) is a popular method for performing approximate Bayesian inference when the likelihood function is intractable. In synthetic likelihood methods, the likelihood function is approximated parametrically via model simulations, and then standard likelihood-based techniques are used to perform inference. The Gaussian synthetic likelihood estimator has become ubiquitous in BSL literature, primarily for its simplicity and ease of implementation. However, it is often too restrictive and may lead to poor posterior approximations. Recently, a more flexible semi-parametric Bayesian synthetic likelihood (semiBSL) estimator has been introduced, which is significantly more robust to irregularly distributed summary statistics. In this work, we propose a number of extensions to semiBSL. First, we consider even more flexible estimators of the marginal distributions using transformation kernel density estimation. Second, we propose whitening semiBSL (wsemiBSL) -- a method to significantly improve the computational efficiency of semiBSL. wsemiBSL uses an approximate whitening transformation to decorrelate summary statistics at each algorithm iteration. The methods developed herein significantly improve the versatility and efficiency of BSL algorithms.

Full PDF

TTransformations in Semi-Parametric Bayesian SyntheticLikelihood

Jacob W. Priddle † , ‡∗ and Christopher Drovandi § , ‡‡ School of Mathematical Sciences, Queensland University of Technology (QUT), Australia;Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers;QUT Centre for Data Science † ORCID ID: 0000-0003-1154-1139 § ORCID ID: 0000-0001-9222-8763July 6, 2020

Abstract

Bayesian synthetic likelihood (BSL) is a popular method for performing approximate Bayesianinference when the likelihood function is intractable. In synthetic likelihood methods, thelikelihood function is approximated parametrically via model simulations, and then standardlikelihood-based techniques are used to perform inference. The Gaussian synthetic likelihoodestimator has become ubiquitous in BSL literature, primarily for its simplicity and ease of im-plementation. However, it is often too restrictive and may lead to poor posterior approxi-mations. Recently, a more ﬂexible semi-parametric Bayesian synthetic likelihood (semiBSL)estimator has been introduced, which is signiﬁcantly more robust to irregularly distributedsummary statistics. In this work, we propose a number of extensions to semiBSL. First, weconsider even more ﬂexible estimators of the marginal distributions using transformation ker-nel density estimation. Second, we propose whitening semiBSL (wsemiBSL) – a method tosigniﬁcantly improve the computational efﬁciency of semiBSL. wsemiBSL uses an approxi-mate whitening transformation to decorrelate summary statistics at each algorithm iteration.The methods developed herein signiﬁcantly improve the versatility and efﬁciency of BSL algo-rithms.

Keywords: likelihood-free inference, approximate Bayesian computation (ABC), kernel densityestimation, copula, covariance matrix estimation, Markov chain Monte Carlo. ∗ Communicating Author: [email protected] a r X i v : . [ s t a t . C O ] J u l Introduction

Simulator models are a type of stochastic model that is often used to approximate a real-lifeprocess. Unfortunately, the likelihood function for simulator models is generally computation-ally intractable, and so obtaining Bayesian inferences is challenging. Approximate Bayesiancomputation (ABC) (Sisson et al., 2018a) and Bayesian synthetic likelihood (BSL) (Price et al.,2018; Wood, 2010) are two methods for approximate Bayesian inference in this setting. Bothmethods eschew evaluation of the likelihood by repeatedly generating pseudo-observationsfrom the simulator, given an input parameter value. ABC and BSL methods have been appliedin many different ﬁelds; recently, in biology, to model the spread of the Banana Bunchy TopVirus (Varghese et al., 2020); in epidemiology, to model the transmission of HIV (McKinleyet al., 2018) and tuberculosis (Lintusaari et al., 2019), and, in ecology, to model the dispersalof little owls (Hauenstein et al., 2019). ABC is a more mature and established technique thanBSL, and so it is more prevalent in applied ﬁelds. However, ABC can suffer from the curse ofdimensionality with respect to the dimension of the summary statistic, requires a large numberof model simulations, and the results can be highly dependent on a set of tuning parameters.BSL methods can be used to overcome many of these limitations.Synthetic likelihood methods approximate the likelihood function with a tractable distribution;in contrast, ABC methods are effectively non-parametric (Blum and Franc¸ois, 2010). The origi-nal synthetic likelihood method of Wood (2010) approximates the summary statistic likelihoodwith a Gaussian distribution and then uses a Markov chain Monte Carlo (MCMC) sampler formaximum likelihood estimation. Later, Price et al. (2018) consider the Gaussian synthetic like-lihood in the Bayesian setting, and refer to their method as Bayesian synthetic likelihood. Inpractice, the Gaussian assumption of the summary statistic vector may be too restrictive, lead-ing to a poor estimate of the likelihood, and then a poor estimate of the posterior. Herein, werefer to the Gaussian BSL method as standard BSL, denoted sBSL.A few authors have considered more ﬂexible density estimators to improve the robustnessof sBSL to irregular summary statistic distributions (e.g. Papamakarios et al., 2018; An et al.,2020; Fasiolo et al., 2018). In particular, the semi-parametric Bayesian synthetic likelihood(semiBSL) method of An et al. (2020), estimates the intractable summary statistic likelihoodsemi-parametrically – non-parametrically estimating the marginal distributions using kerneldensity estimation (KDE), and parametrically estimating the dependence structure using aGaussian copula. An et al. (2020) show empirically that semiBSL performs favourably to sBSLwhen summary statistics are distributed irregularly. semiBSL maintains many of the attractiveproperties of sBSL, including its scalability to a high dimensional summary statistic and easeof tuning.Despite the appeal of semiBSL, the number of model simulations required to accurately esti-mate the correlation matrix scales poorly with the dimension of the summary statistic. Theequivalent problem for sBSL, the scaling of the estimation of covariance matrix with the num-ber of model simulations, has been explored by An et al. (2019), Ong et al. (2018a), Ong et al.(2018b), Everitt (2017), Frazier et al. (2019) and Priddle et al. (2020). However, there are cur-rently no methods designed speciﬁcally for the semi-parametric estimator, which, in practice,may preclude its application to problems where model simulation is computationally expen-sive. The ﬁrst contribution of this article adapts and extends the methodology presented inPriddle et al. (2020), which combines a whitening transformation with shrinkage covariance2atrix estimation, to the semiBSL context.SemiBSL provides additional robustness over sBSL when the summary statistic marginals de-viate from normality. However, as we demonstrate in subsequent sections, for some distri-butions the KDE will fail. For instance, when a marginal summary statistic distribution hasextremely heavy tails, the KDE will allocate essentially no density to the center of the distri-bution, and all weight to the tails (see Figure 2). In addition, it is well-known that the globalbandwidth KDE rarely provides adequate smoothing over all features of the underlying distri-bution (Wand et al., 1991; Yang et al., 2003). Our second contribution addresses this problemwith a procedure that draws upon and extends the vast body of literature on density estima-tion. Speciﬁcally, we consider transformation kernel density estimation (TKDE, Wand et al.,1991) to estimate the marginal distributions of the summary statistic. The idea is to transformthe distribution so that the standard global bandwidth KDE is accurate, and then transformback to the original domain to estimate the density. We adapt the hyperbolic power transfor-mation of Tsai et al. (2017), and propose a procedure to effectively apply TKDEs in a semiBSLalgorithm.The remainder of this article is structured as follows. In sections 2 and 3, we provide anoverview of sBSL and semiBSL, respectively. In section 4, we present our method to signiﬁ-cantly improve the computational efﬁciency of semiBSL. In section 5, we propose a new esti-mator of the marginal summary statistic distributions for semiBSL using TKDE. We assess theaccuracy of the TKDEs on a number of test densities with known distribution. In section 6, weapply our new methods to four different examples. Last, we conclude in section 7.

Synthetic likelihood algorithms are applicable in settings where the likelihood function p ( y | θ ) is intractable but simulation from the model is straightforward, where y = ( y , ..., y m ) (cid:62) (with m ≥ ) is the set of observed data and θ ∈ Θ ⊂ R p is an unknown parameter. Here, our targetis the posterior distribution p ( θ | y ) ∝ p ( y | θ ) p ( θ ) , where p ( θ ) is the prior distribution on the pa-rameter. In synthetic likelihood, among other likelihood-free algorithms, such as approximateBayesian computation (ABC) (see Sisson et al., 2018b), it is standard practice to degrade thedata to a vector of informative summary statistics to help mitigate problems associated withdimensionality. Speciﬁcally, let S ( · ) : R m → R d be the summary statistic function that mapsan m -dimensional dataset to a d -dimensional summary statistic. For s y = S ( y ) , the impliedtarget conditional on the summary statistic, often referred to as the partial posterior, is then p ( θ | s y ) ∝ p ( s y | θ ) p ( θ ) ; depending (to a large extent) on the informativeness of the summarystatistic, p ( θ | y ) ≈ p ( θ | s y ) . However, since p ( y | θ ) is intractable, it is generally the case that p ( s y | θ ) is also intractable, which leads us to consider sampling based methods that do notrequire evaluation of p ( s y | θ ) to obtain approximate inferences from the partial posterior.In essence, synthetic likelihood methods assume a parametric form of the likelihood, whichacts as a surrogate for the true likelihood and may be used directly in an MCMC (Markovchain Monte Carlo) sampler. In sBSL (see Price et al., 2018), the summary statistic likelihoodis approximated with a Gaussian distribution, N ( s y ; µ ( θ ) , Σ( θ )) . The synthetic likelihoodparameters µ ( θ ) and Σ ( θ ) are typically unknown, but a series of n independent and identi-cally distributed simulations from the model x , . . . , x n ∼ p ( ·| θ ) with corresponding summary3tatistics S ( x ) , . . . , S ( x n ) can be used to construct the Monte Carlo estimates: µ n ( θ ) = 1 n n (cid:88) i =1 S ( x i ) and (1) Σ n ( θ ) = 1 n − n (cid:88) i =1 ( S ( x i ) − µ n ( θ ))( S ( x i ) − µ n ( θ )) (cid:62) . (2)These may be used to yield the Gaussian synthetic likelihood estimator, N ( s y ; µ n ( θ ) , Σ n ( θ )) ,and the corresponding sBSL posterior approximation: p sBSL ( s y | θ ) = (cid:90) N ( s y | µ n ( θ ) , Σ n ( θ )) n (cid:89) i =1 p ( S ( x i ) | θ ) dS ( x ) · · · S ( x n ) p sBSL ( θ | s y ) ∝ p BSL ( s y | θ ) p ( θ ) . There are two main appeals of BSL: (1) that it can handle a relatively high dimensional sum-mary statistic, and (2) that it can be more computationally efﬁcient than competing likelihood-free Bayesian methods (Price et al., 2018; Frazier et al., 2019). These are both direct beneﬁts ofspecifying a parametric form of the summary statistic likelihood. However, as demonstratedby An et al. (2020), in cases where the marginal summary statistic distributions deviate greatlyfrom Gaussian, with, for example, heavy skewness, heavy tails or multiple modes, sBSL meth-ods begin to break down. Often the posterior distribution will fail to adequately approximatethe true partial posterior. In particularly challenging cases, the variance of the log syntheticlikelihood estimator may be so large that the MCMC chain will become stuck within only afew iterations, and no discernible posterior distribution may be recovered (see Figure 8).

In this section, we provide an overview of the semiBSL method of An et al. (2020). semiBSLprovides additional robustness for a non-Gaussian distributed summary statistic. In semiBSL,the semi-parametric likelihood estimator is constructed as follows. Denote S j the random vari-able corresponding to the j th summary statistic. Given the set of n model simulations, the truePDF (probability density function) g S j ( s ) is approximated using the kernel density estimate: ˆ g S j ( s ) = 1 n n (cid:88) i =1 K h ( s − S ( x i ) j ) , (3)where K h ( u ) = h − K ( u/h ) and h is the bandwidth. The kernel function K ( · ) may be anysymmetric PDF; in semiBSL, the Gaussian kernel K ( u ) = 1 / √ π exp (cid:8) − u / (cid:9) is used due to itssimplicity and unbounded support. The above kernel density estimator uses a global (constant)bandwidth, selected according to the rule of Silverman (1986). It is straightforward to obtainthe corresponding estimate of the CDF (cumulative density function) ˆ G S j ( s ) from the aboveequation.Following estimation of the marginal summary statistic distributions, the dependence betweenthe summaries is modelled via the Gaussian copula. Essentially, copula modelling allows the4ependence structure and the marginal distributions to be estimated independently, allowingthe user to consider alternative and more ﬂexible marginal density estimators than the Gaus-sian distribution, as the case is in sBSL. For an introduction to copula models, we refer thereader to Trivedi et al. (2007). The Gaussian copula density, c ( u ) = 1 (cid:112) det ( R ) exp (cid:26) − η (cid:62) ( R − − I d ) η (cid:27) is parameterised by the correlation matrix R and the vector of standard Gaussian quantiles η =(Φ − ( u ) , . . . , Φ − ( u d )) (cid:62) , where Φ − ( · ) is the inverse CDF of the standard normal distributionand u j = G S j ( s j y ) for j = 1 , . . . , d . Replacing G S j ( s ) with its kernel density estimate evaluatedat the observed summary ˆ G S j ( s j y ) , and R with the estimated correlation matrix ˆ R , we obtainthe semiBSL posterior: p semiBSL ( s y | θ ) = (cid:90) (cid:113) det ( ˆ R ) exp (cid:26) −

12 ˆ η (cid:62) s y ( ˆ R − − I d )ˆ η s y (cid:27) d (cid:89) j =1 ˆ g j ( s j y ) n (cid:89) i =1 p ( S ( x i ) | θ ) dS ( x ) · · · S ( x n ) p semiBSL ( θ | s y ) ∝ p semiBSL ( s y | θ ) p ( θ ) . In the above equation, ˆ η s y = (Φ − (ˆ u ) , . . . , Φ − (ˆ u d )) (cid:62) where ˆ u j = ˆ G j ( s j y ) for j = 1 , . . . , d and ˆ R is estimated using a collection of n simulated summary statistics S ( x ) , . . . , S ( x n ) . Inpractice, An et al. (2020) advocate to estimate R with the Gaussian rank correlation (GRC) (seeBoudt et al., 2012), which provides additional robustness to the potential lack of ﬁt of the KDEs.We highlight two main limitations of semiBSL. First, the number of model simulations requiredto accurately estimate R scales poorly with d . This may be problematic for applications wheremodel simulation is computationally expensive, especially if a relatively low dimensional andinformative summary statistic is unavailable. Furthermore, the KDE is unreliable for distribu-tions with extremely heavy tails, which may induce unduly high variance in the p semiBSL ( s y | θ ) estimator and cause semiBSL to fail. In subsequent sections, we propose methods to overcomeeach of these limitations. We now propose a method to improve the computational efﬁciency of semiBSL. Namely, we ex-tend the whitening BSL (wBSL) methodology proposed by Priddle et al. (2020) to the semiBSLcontext. The motivation behind wBSL is articulated in Theorem 1 of Priddle et al. (2020). Themain consequence of the theorem is that for a Gaussian log synthetic likelihood estimator withdiagonal covariance structure, n must scale linearly with d to control the variance of the es-timator. On the other hand, to control the variance of the traditional Gaussian log syntheticlikelihood estimator (that estimates the full covariance structure), n must scale quadraticallywith d . This result suggests that there are signiﬁcant computational beneﬁts possible in BSLalgorithms if the summary statistics are uncorrelated.Despite such a compelling result, it is a challenging problem to ﬁnd a summary statistic vectorthat is both independent across its dimensions and retains a large proportion of the informationcontent intrinsic to the observed data. The main idea of wBSL is that an approximate whiten-ing or decorrelation transformation may be applied to the summary statistic at each algorithm5teration. In doing so, the covariance shrinkage estimator of Warton (2008) may be appliedwith a high penalty, producing an accurate, low variance estimate of the likelihood functionfor a relatively small number of model simulations. If the full penalty is applied, this coincideswith the Gaussian synthetic likelihood estimate with a diagonal covariance structure, and thusthe desired computational gains may be achieved. In several empirical examples, Priddle et al.(2020) demonstrate that wBSL is able to produce an accurate partial posterior approximation,with an order of magnitude less model simulations than sBSL. Given the semi-parametric syn-thetic likelihood estimator uses the Gaussian copula, it is likely that it will inherit similar com-putational gains to the classical Gaussian estimator, particularly in cases where the marginaldistributions are close to Gaussian. However, the extension of these concepts to semiBSL is notyet clear; here we provide an outline of our methodology, which we refer to as wsemiBSL.Consider the Gaussian approximation of the summary statistic likelihood: N ( s y ; µ , Σ ) = 1 (cid:112) (2 π ) d det ( Σ ) exp (cid:26) −

12 ( s y − µ ) (cid:62) Σ − ( s y − µ ) (cid:27) , where the dependence of µ and Σ on θ has been suppressed for notational convenience. It isstraightforward to show that: N ( s y ; µ , Σ ) ∝ (cid:112) det ( R ) exp (cid:26) − η s y (cid:62) R − η s y (cid:27) d (cid:89) j =1 N ( η jy ; 0 , σ j ) φ (ˆ η jy ) , where Σ = Σ / d R Σ / d and Σ d = diag ( σ , . . . , σ d ) .The main disparity between wBSL and wsemiBSL, is that in wsemiBSL the whitening transfor-mation is applied to the standard Gaussian quantiles, and not directly to the summary statis-tics. We ﬁnd that in the context of semiBSL, the latter approach does not produce as accurateposterior approximations (results not shown). Speciﬁcally, we propose to apply the whiteningtransformation to convert the random vector η of arbitrary distribution with covaraince matrixVar ( η ) = R into the transformed vector ˜ η = W η for some d × d whitening matrix W , such that the covariance Var (˜ η ) = I d is the identitymatrix. Like in wBSL, we estimate the whitening matrix off-line using n cov independent modelsimulations such that x , . . . , x n cov ∼ p ( ·| θ ) given some carefully chosen parameter value θ with reasonable posterior support. Picchini et al. (2020) detail how Bayesian optimization maybe used to rapidly generate a θ that has reasonable support under the posterior. This method,or the techniques described in Priddle et al. (2020), may be employed to ﬁnd a suitable θ forour methods. n cov is set high (much higher than n ) to ensure an accurate estimate of W isobtained. Of course, for the transformation to be exact, W must evolve as a function of θ ;however, like in wBSL, we hold W constant to preserve the target partial posterior obtainedusing semiBSL (when no penalty is applied), and so generally Var (˜ η ) ≈ I d . Given the inversetransformation η = W − ˜ η and Jacobian term | d η /d ˜ η | = det ( W − ) , the summary statistic6ikelihood under the transformed variable is ˜ g ( s y | θ ) ∝ det ( W − ) (cid:112) det ( R ) exp (cid:26) −

12 ( W − ˜ η s y ) (cid:62) R − W − ˜ η s y (cid:27) d (cid:89) j =1 N ( η js y ; 0 , σ j ) φ ( η js y )= 1 (cid:113) det ( ˜ Σ η ) exp (cid:26) −

12 ˜ η (cid:62) s y ˜ Σ − η ˜ η s y (cid:27) d (cid:89) j =1 N ( η js y ; 0 , σ j ) φ ( η js y ) , where ˜ Σ η = W RW (cid:62) = Var (˜ η s y ) ≈ I d is the covariance matrix of the transformed quantiles ˜ η s y . Of course, in semiBSL, we replace each marginal N ( η js y ; 0 , σ j ) with the kernel densityestimate ˆ g S j ( s ) and ˜ Σ η with a sample estimate. That is, ˜ g ( s y | θ ) ∝ (cid:113) det ( ˜ Σ η ) exp (cid:26) −

12 ˆ˜ η (cid:62) s y ˆ˜ Σ − η ˆ˜ η s y (cid:27) d (cid:89) j =1 ˆ g j ( s jy ) φ (ˆ η j s y ) . where ˆ˜ η s y = W (Φ − (ˆ u ) , . . . , Φ − (ˆ u d )) (cid:62) and ˆ u j = ˆ G j ( s j y ) for j = 1 , . . . , d . ˆ˜ Σ η is estimatedusing n simulated quantiles ˆ˜ η S ( x ) , . . . , ˆ˜ η S ( x n ) which constitute the rows of the n × d matrix W (ˆ η S ( x ) , . . . , ˆ η S ( x n ) ) (cid:62) such that ˆ η S ( x i ) = (Φ − (ˆ u i ) , . . . , Φ − (ˆ u di )) (cid:62) and ˆ u ji = ˆ G j ( S ( x i ) j ) for j = 1 , . . . , d and i = 1 , . . . , n . Given the whitening transformation approximately decorrelatesthe summary statistic quantiles, the Warton (2008) covariance shrinkage estimator ˜ Σ η ,γ = ˜ Σ / η ,d ( γ ˜ R η + (1 − γ ) I d ) ˜ Σ / η ,d may be applied accurately with a high degree of shrinkage, where ˜ Σ η ,d = diag ( ˜ Σ η ) , ˜ R η isan estimate of the correlation matrix and γ ∈ [0 , is the shrinkage parameter. Effectively, γ is a constant that is multiplied by the off-diagonal elements of the sample covariance. Thus, γ = 0 shrinks the pairwise covariance elements to , assuming independent summary statisticquantiles. The heavier the shrinkage, the lower the value of n required to precisely estimatethe likelihood.The choice of whitening matrix W was considered carefully in Priddle et al. (2020). Any W that satisﬁes Var (˜ η ) = Var ( W η ) = W Σ W (cid:62) = I d will whiten the data at θ ; however, asthe current parameter value deviates further from θ , the transformation will become less ac-curate. The most suitable W for BSL is the one that most effectively decorrelates summarystatistics generated by parameter values that reside in regions of the parameter space with non-negligible posterior density. Priddle et al. (2020) consider the ﬁve optimal whitening matricesof Kessy et al. (2018). Priddle et al. (2020) ﬁnd that in the context of BSL, principal componentsanalysis (PCA) whitening produces the most accurate partial posterior approximations uponthe application of heavy shrinkage. Thus, in wsemiBSL we also use the PCA whitening matrix, W PCA , η = Λ − / η U (cid:62) η , where Λ η and U η are the eigenvalue and eigenvector matrices of the covariance matrix Var ( η ) = Σ η such that Σ η = U η Λ η U (cid:62) η . 7 Transformation KDE in semiBSL

Our second contribution signiﬁcantly improves the robustness of the semi-parametric estima-tor proposed in An et al. (2020) in the context of BSL. As demonstrated in Figure 2, if a givenmarginal summary statistic distribution has extremely heavy tails, as is common in ﬁnancialapplications for example (see Section 6.4), the standard KDE does not accurately approximatethe true marginal distribution for reasonable sample sizes (number of model simulations in ourcontext). We propose a new semi-parametric estimator that uses transformation kernel densityestimation (see Wand et al., 1991) to model each marginal summary statistic. Like in the classicsemiBSL estimator, we model the dependence between the summary statistic dimensions usingthe Gaussian copula. By doing so, the whitening method proposed in the previous section maybe applied in conjunction with the new estimator, to achieve computational gains on top of theimproved robustness. In this section, we provide details of our TKDE method for semiBSL.Transformation kernel density estimation was introduced by Wand et al. (1991); although, thegeneral ideas have been applied in many different contexts (see, for example, Kingma et al.,2016; Parno and Marzouk, 2018). In brief, the idea is to transform a sample of data so thatthe standard global bandwidth kernel density estimator (as in (3)) is more accurate, and thentransform back to the original domain to obtain the estimate of the desired density.Recall we are interested in estimating the marginal distributions of the summary statistic vec-tor. That is, for the j th marginal S j , we wish to provide an estimate of the true density g S j ( s ) with support supp ( g S j ) given access to our sample S ( x ) j , ..., S ( x n ) j . Hereafter we suppressthe j notation for simplicity, and emphasise that we are considering a univariate distribution.Denote a family of bijective and differentiable transformations {G ω : ω ∈ Ω } indexed by theparameter ω that map supp ( g S ) to the real line. The PDF of the transformed random variable ˜ S = G ω ( S ) is given by: g ˜ S (˜ s ; ω ) = g S ( G − ω (˜ s )) (cid:12)(cid:12)(cid:12)(cid:12) d G − ω (˜ s ) d ˜ s (cid:12)(cid:12)(cid:12)(cid:12) . The value of ω is chosen so that g ˜ S is approximately Gaussian. Given this, KDE should providean accurate approximation of the PDF on the transformed domain according to ˆ g ˜ S (˜ s ; h, ω ) = 1 n n (cid:88) i =1 K h (˜ s − S ( x i )) . An estimate of the density on the original domain is then obtained via the inverse transforma-tion: ˆ g S ( s ; h, ω ) = 1 n n (cid:88) i =1 K h ( G ω ( s ) − G ω ( S ( x i ))) (cid:12)(cid:12)(cid:12)(cid:12) d G ω ( s ) ds (cid:12)(cid:12)(cid:12)(cid:12) . The above estimator can be thought of as using a location adaptive bandwidth on the originaldomain. This allows more appropriate smoothing over all features of the density, and oftenleads to a more accurate density approximation. Variable bandwidth methods, such as thoseproposed in Loftsgaarden et al. (1965) and Breiman et al. (1977) explicitly model the bandwidthas a function of the data. We ﬁnd (results not shown) for the test distributions considered inthis paper, that TKDE performs better. 8 non-trivial aspect of applying TKDE to semiBSL is choosing an appropriate family of trans-formations, and then ﬁnding a method of efﬁciently estimating ω . The most suitable familyof transformations is highly dependent on the shape of the data. Wand et al. (1991) focus onright-skewed data and use the shifted power transformation; Yang et al. (2003) use sequentialtransformations from the Johnson family to estimate the density of a wide range of distribu-tions and Buch-Larsen et al. (2005) use the Champernowne transformation for heavy-taileddata. Our method can be extended to use any of these transformations (among others), but,due to its ﬂexibility, we focus on the hyperbolic power transformation (HPT) introduced byTsai et al. (2017), which has not previously been used in the TKDE context. The HPT is givenby: G ω ( s ) = (cid:40) ν sinh( ψ − s ) sech λ − ( ψ − s ) /ψ − s ≤ ν sinh( ψ + s ) sech λ + ( ψ + s ) /ψ + s > where s is median centered, ω = { ν, ψ − , λ − , ψ + , λ + } , ν, ψ − , ψ + > and | λ − | , | λ + | ≤ . λ − , λ + are the power parameters; ψ − , ψ + are the scale parameters, and ν is the normalising constant.By splitting the data either side of the median, the transformation is able to handle bimodal dis-tributions, provided the modes are not well separated. As demonstrated by Tsai et al. (2017),the HPT outperforms other relevant normality transformations for a wide range of distribu-tions.There are many different optimality criteria possible to determine ω . Wand et al. (1991) andYang et al. (2003) use asymptotic results based on minimising the mean integrated square error.Here we follow the approach used in Tsai et al. (2017) and use maximum likelihood estimation.That is, given we wish to transform the summary statistics such that the global bandwidth KDEwill perform well, we target the standard normal distribution p ( G ω ( s )) = √ π exp {−G ω ( s ) / } in our transformation. It can be shown that the objective function is given by: log p ( S ( x i ) , . . . , S ( y n ) | ω ) = n (cid:88) i =1 log φ ( G ω ( S ( x i ))) + log | J ( S ( x i )) | where φ is the PDF of the standard normal distribution, and the Jacobian term is: | J ( s ) | = (cid:12)(cid:12)(cid:12)(cid:12) ∂ G ω ( s ) ∂s (cid:12)(cid:12)(cid:12)(cid:12) = (cid:40) ν (1 − λ − tanh ( ψ − s )) sech λ − − ( ψ − s ) s ≤ ν (1 − λ + tanh ( ψ + s )) sech λ + − ( ψ + s ) s > . In practice, there are often several solutions to the score equation, however, only one is theglobal maximum. Tsai et al. (2017) employ the simplex method (see Nelder and Mead, 1965)to approximate the MLEs by iteratively optimising each split of the data separately and thenperturbing the estimate of the slope parameter according to its MLE: ˆ ν = (cid:32) n n (cid:88) i =1 ( G ω ( S ( x i )) (cid:33) − / . In our implementation, we take a similar approach to Tsai et al. (2017) in splitting the data andestimating each of the pairs { ψ − , λ − } and { ψ + , λ + } separately. However, we update the valueof ν using the MLE (using the relevant split of the data) for each evaluation of the likelihood,9ue to its dependence on the other parameters. We ﬁnd that this approach works well withouthaving to iteratively maximise the parameters and perturb ν . This is crucial in the context ofsemiBSL as each iteration of MCMC will involve an estimate of the synthetic likelihood at theproposed parameter value. Of course, our method only serves as an approximation of the truemaximum, but as we shall demonstrate, this is sufﬁcient to signiﬁcantly improve the accuracyof the density estimate over standard KDE. Each marginal summary statistic distribution maybe estimated in parallel, meaning the overall additional computational time is small. We usethe quantile approach outlined in Tsai et al. (2017) to initialise ω for each optimisation problem.Alternatively, optimal parameters found at previous iterations may be used to inform initialparameter values at subsequent iterations.Despite the appeal of the HPT, we ﬁnd that for very heavy tailed data, the transformation isnot numerically stable. It is also a non-trivial task to reparameterise the transformation suchthat it is numerically stable. Therefore, we propose an extension of the HPT that uses a series of log transformations to ﬁrst reduce the heaviness of tails, allowing the HPT to subsequently beapplied more effectively. The log transformations do not require any estimation of parameters,and so they add negligible computation time. For positively skewed data with heavy kurtosis,we use the transformation: ˜ S = log(1 + S − min( S ( x ) , . . . , S ( x n )) + ∆) where ∆ = min( S ( x ) , . . . , S ( x n )) − s y + 1 if s y < min( S ( x ) , . . . , S ( x n )) , otherwise ∆ =0 . Analogously, we use the following transformation for negatively skewed data with heavykurtosis: ˜ S = − log(1 − S + max( S ( x ) , . . . , S ( x n )) + ∆) where ∆ = s y − max( S ( x ) , . . . , S ( x n )) + 1 if s y > max( S ( x ) , . . . , S ( x n )) , otherwise ∆ = 0 .Lastly, for symmetric data with heavy kurtosis we use ˜ S = sgn ( S ) log(1 + S sgn ( S )) . When one of the above three log transformations is applied concurrently with the HPT, werefer to each method as semiBSL TKDE1, semiBSL TKDE2, or semiBSL TKDE3, respectively.SemiBSL TKDE0 refers to semiBSL TKDE without an initial log transformation, and semiBSLTKDE is the general method of using transformation kernel density estimation for semiBSL. Weemphasise that the main purpose of the log transformations is to transform the data such thatthe HPT can be accurately computed, not to transform the data to Gaussian. It is the HPTs jobto Gaussianise the log transformed data. Figure 1 shows the estimated density after each stepof the TKDE procedure for several test densities with known PDF. Each test density is close tostandard Gaussian after applying the HPT (row 3, Figure 1), and the ﬁnal density estimate isclose to the true density (row 4, Figure 1). For our semiBSL TKDE method, we recommend theuser perform a number of model simulations at θ , visualise the marginal summary statisticdistributions and then decide whether or not (and which) log transformation is necessary.To illustrate the efﬁcacy of the proposed density estimation procedure, we perform a simu-lation study using a wide a range of distributions with known density. This follows directlyfrom the work in An et al. (2020). Speciﬁcally, we assume the observed data is drawn fromthe standard Gaussian distribution y ∼ φ , and the summary statistic is given by S ( y ) = inh (cid:0) δ (cid:0) sinh − ( y ) + (cid:15) (cid:1)(cid:1) (this is the sinh-archsinh transformation of Jones and Pewsey, 2009). (cid:15) and δ control the skewness and kurtosis respectively. Here we choose the values of (cid:15) and δ to reﬂect the shapes of densities that arise in practice, for example, in the models of Section6. We also consider an observed dataset drawn directly from a bimodal Gaussian distribution,such that y = 0 . N (3 ,

1) + 0 . N (8 , and take S ( y ) = y . For each test density, we estimatethe PDF using KDE and TKDE for n = 100 , n = 500 and n = 1000 . For TKDE, we showthe results using the most appropriate log transformation (or lack thereof), see Figure 2. Fur-thermore, we estimate the total variation distance between the true and estimated PDFs usingnumerical integration over a grid of parameter values based on 1000 independent replicatesof the above procedure. We report the sample mean and standard deviation of the 1000 totalvariation distances (see Table 1). The total variation between two PDFs f ( θ ) and f ( θ ) is givenby tv ( f , f ) = (cid:82) | f ( θ ) − f ( θ ) | d θ .Figure 1: Intermediate densities of TKDE procedure for various test densities. Each row correspondsto a step in the density estimation: row 1 is a histogram of the original data; row 2 is a KDE afterthe log transformation; row 3 is a KDE after the HPT (with the standard normal distribution in black)and row 4 is the ﬁnal density estimate on the original domain (with the true PDF shown in black).Columns correspond to each test density: skewness and kurtosis ( (cid:15) = 1 . , δ = 0 . ; left), kurtosis only( (cid:15) = 0 , δ = 0 . ), skewness only ( (cid:15) = 5 , δ = 1 ), bimodal ( . N (3 ,

1) + 0 . N (8 , ), heavy skewness( (cid:15) = 0 , δ = 0 . ) and skewness with heavy kurtosis ( (cid:15) = 5 , δ = 0 . ; right). Figure 2 demonstrates that the proposed TKDE scheme is able get much closer to the truePDF than standard KDE, even with a small number of model simulations. The TKDE nicelycaptures the peaks of each distribution, and provides adequate smoothing over the tails. For (cid:15) = 0 , δ = 0 . , the KDE appears completely ﬂat due to the extremely heavy tails, whereasthe TKDE is very accurate. TKDE also outperforms KDE for the bimodal test density, witha noticeably better performance for n = 1000 . Interestingly, in some cases, we ﬁnd that the log transformation is detrimental to the TKDE, and so the user must carefully decide whetheror not the log transformation is needed. The simulation results in Table 1 support the above11igure 2: Comparison of density estimators (KDE and TKDE) with the true density. Rows correspondto n = 100 (top row), n = 500 and n = 1000 (bottom row) model simulations. Columns correspond tothe same test densities listed in Figure 5. Table 1:

Total variation distances between density estimators (KDE and TKDE) and the true density.The mean is reported for each of the four test densities using n = 100 , n = 500 and n = 1000 modelsimulations. The corresponding standard deviations are given in parentheses. n = 100 n = 500 n = 1000 KDE TKDE KDE TKDE KDE TKDESkewness and Kurtosis ( (cid:15) = 1 . , δ = 0 . (0.027) (0.033) (0.014) (0.012) (0.010) (0.008) Kurtosis ( (cid:15) = 0 , δ = 0 . (0.022) (0.030) (0.011) (0.011) (0.008) (0.008) Skewness ( (cid:15) = 5 , δ = 1) (0.023) (0.025) (0.011) (0.011) (0.008) (0.007) Bimodal . N (3 ,

1) + 0 . N (8 , (0.028) (0.032) (0.010) (0.015) (0.007) (0.011) Heavy Kurtosis ( (cid:15) = 0 , δ = 0 . (0.013) (0.033) (0.008) (0.011) (0.006) (0.007) Skewness and HeavyKurtosis ( (cid:15) = 5 , δ =0 . (0.011) (0.009) (0.005) (0.004) (0.004) (0.003) ﬁndings, with all TKDEs having a lower total variation distance than the corresponding KDE.The beneﬁts of TKDE are most apparent for heavy tailed distributions.12 Results

In this section, we apply our methods to four examples. The examples, and what they aredesigned to demonstrate are listed below:1. MA (2) example: demonstrates the potential efﬁciency gains of wsemiBSL; the robust-ness of semiBSL TKDE, and the simultaneous use of whitening and TKDE in semiBSL(wsemiBSL TKDE hereafter) for improved efﬁency and robustness.2. Fowler’s Toads example: demonstrates the potential efﬁciency gains of wsemiBSL.3. M/G/1 example: demonstrates the improved robustness of semiBSL TKDE.4. α -stable stochastic volatility model: demonstrates the improved robustness of semiBSLTKDE.The likelihood for the MA (2) example is known, allowing us to compare the result of ourmethods to the output of a Metropolis-Hastings sampler that uses the true likelihood. Each ofthe remaining three models have an intractable likelihood and are representative of a real-lifemodelling scenario.In all cases, we use the Metropolis-Hastings algorithm with a Gaussian random walk. Therandom walk covariance matrix is set to be roughly the (approximate) posterior covarianceobtained from pilot runs. Unless stated otherwise, the value of n is tuned such that the standarddeviation of the log synthetic likelihood evaluated at θ is in the range [1 , , as Price et al.(2018) ﬁnd that this maximises the computational efﬁciency of sBSL. We compare posteriorapproximations using the total variation distance, as described in Section 5. For wsemiBSL,we use n cov = 5000 to accurately estimate W . Each sampler is run for T = 100000 MCMCiterations.

The t th observation x t in a moving average process of order 2, denoted MA (2) , evolves accord-ing to: x t = w t + θ w t − + θ w t − where w i ∼ N (0 , σ ) for i = 1 , . . . , T subject to the constraints − < θ < , θ + θ > − and θ − θ < . It is straightforwardto show that the likelihood is Gaussian with zero mean vector and pentadiagonal covariancematrix, with entries given by: ζ (0) = 1 + θ + θ , ζ (1) = θ + θ θ and ζ (2) = θ , where ζ ( k ) = Cov ( x t , x t − k ) . The MA (2) model is commonly used as a toy example to demonstratelikelihood-free methods (see, Chiachio et al., 2014; Marin et al., 2012; Frazier et al., 2019).We simulate 50 observations from the MA (2) process at θ true = ( θ , θ ) (cid:62) = (0 . , . (cid:62) andset this to be our observed data, such that y = ( x , . . . , x ) (cid:62) . We assume that σ is known,and equal to 1. For semiBSL, we are interested in cases where the marginal summary statisticdistributions deviate from Gaussian. As in Section 5, we use the sinh-archsinh transformationof Jones and Pewsey (2009) to transform and generate a summary statistic with non-Gaussian13arginals; thus, s y = S ( y ) , where S ( · ) is the sinh-archsinh transformation applied element-wise. We use a uniform prior over the parameter space.We ﬁrst test our methods with a summary statistic generated with 4 different (cid:15) and δ combina-tions. We consider (cid:15) = 0 , δ = 1 , which corresponds to no transformation; (cid:15) = − , δ = 1 , whichcreates negative skewness; (cid:15) = 0 , δ = 0 . , which creates positive kurtosis and (cid:15) = 1 , δ = 2 ,which creates negative kurtosis and positive skewness. For each of these summary statistics,we consider the following methods: semiBSL (equivalent to wsemiBSL with γ = 1 ); wsemiBSLwith γ = 0 ; semiBSL TKDE ( γ = 1 ) and wsemiBSL TKDE with γ = 0 . We compare the resultsto the ‘true’ posterior, which is obtained using an MCMC sampler with the true likelihood.Posterior approximations are shown in Figure 3. Comparing columns 1 with 3 (no shrinkage, γ = 1 ), and columns 2 with 4 (complete shrinkage, γ = 0 ), it can be seen that the posteriorapproximations obtained with TKDE are generally more accurate in terms of the total variationdistance to the ‘true’ posterior. The only case where the posterior approximation obtainedusing TKDE is less accurate than the corresponding estimate that uses KDE (albeit slightly,with tv distances of 0.2 and 0.16, respectively), is when γ = 0 and the marginal summarystatistics have negative kurtosis ( (cid:15) = 0 , δ = 0 . ; row 3 of Figure 3).The bivariate posterior approximations obtained using wsemiBSL with complete shrinkage aregood approximations of the ‘true’ posterior in all cases (Figure 3). Interestingly, we ﬁnd thatthere is a quite a strong dependence between the regularity of the marginal summary statis-tic distributions and the capacity of wsemiBSL to signiﬁcantly reduce the number of modelsimulations. For no summary statistic transformation, the skewness transformation and theskewness and kurtosis transformation, wsemiBSL is extremely effective – allowing us to re-duce n by about an order of magnitude. However, for the kurtosis transformation, we are onlyable to reduce n by a factor of three, while accurately estimating the log synthetic likelihood.In addition, we ﬁnd that wsemiBSL is generally not as effective in reducing the number modelsimulations when TKDE is used compared to when KDE is used, to estimate the marginal sum-mary statistic distributions. This is the case for all summary statistics except for the kurtosistransformation, where n could be reduced further (from n = 330 to n = 275 ) when TKDE isused compared to standard KDE.We consider two additional summary statistics with extremely heavy kurtosis. Speciﬁcally, weset (cid:15) = 0 and δ = 0 . , which creates negative kurtosis, and also (cid:15) = 5 and δ = 0 . , whichcreates heavy negative kurtosis and positive skewness. This presents an extremely challengingexample for standard semiBSL. We ﬁnd that n = 750 is required for semiBSL TKDE for bothdatasets and n = 20000 is required for semiBSL when (cid:15) = 5 and δ = 0 . . However, weare unable to ﬁnd an n that can accurately estimate the log synthetic likelihood when (cid:15) = 0 and δ = 0 . , since the KDE completely fails even for a huge number of model simulations(due to the heaviness of the tails). We also consider n = 750 for semiBSL, representing thesame number of model simulations used for semiBSL TKDE. For these examples, wsemiBSLis ineffective at reducing the required number of model simulations as the marginal summarystatistics deviate too far from Gaussian and the pairwise correlation is low.Bivariate posterior approximations are shown in Figure 4. For n = 750 , it can be seen thatstandard semiBSL completely fails, while semiBSL TKDE produces an accurate posterior ap-proximation, for both summary statistics. From Figure 5, it can be seen that the acceptance ratesare much higher for semiBSL TKDE than standard semiBSL. For (cid:15) = 0 and δ = 0 . for standardsemiBSL, the variance of the log synthetic likelihood is so high that no samples are accepted.14hen n = 20000 model simulations are used for semiBSL, the parameter space appears to beexplored well (Figure 5), but the posterior approximation is far less accurate than the semiBSLTKDE method (tv = 0 . compared to tv = 0 . ), which only used n = 750 model simulations. -4 -2 0 2 400.10.20.30.40.5 = , = semiBSL tv = 0.06n = 80000.20.40.60.8 wsemiBSL, = 0 tv = 0.23n = 80 semiBSL TKDE tv = 0.06n = 800 wsemiBSL TKDE, = 0 tv = 0.18n = 255-10 -5 000.10.20.30.4 = - , = tv = 0.17n = 70000.20.40.60.8 tv = 0.41n = 90 tv = 0.09n = 700 tv = 0.22n = 240-20 0 2000.050.10.150.20.25 = , = . tv = 0.07n = 100000.20.40.60.8 tv = 0.16n = 330 n = 1000tv = 0.06 tv = 0.2n = 2750 1 2 s = , = tv = 0.08n = 7000.2 0.4 0.6 0.8 tv = 0.3n = 500.2 0.4 0.6 0.8 tv = 0.07n = 7000.2 0.4 0.6 0.8 tv = 0.17n = 2500.2 0.4 0.6 0.8 Figure 3:

Bivariate posterior approximations and true marginal summary statistic distributions for theMA(2) example – plot 1. Columns denote (left to right) the true marginal summary statistic distribution,semiBSL KDE, wsemiBSL with γ = 0 , semiBSL TKDE and wsemiBSL TKDE with γ = 0 . Each row usesthe same marginal summary statistics. Black contours correspond to be the output of an M-H samplerusing the known likelihood, green contours are for γ = 1 and blue contours are for γ = 0 . The parameterused to generate the observed data is shown in red. tv denotes the total variation distance between theapproximate and true bivariate posterior distributions. The next example we consider is the individual-based movement model of Fowler’s Toads(

Anaxyrus fowleri ) developed by Marchand et al. (2017). The model has since been consideredas a test example in likelihood-free literature by several authors (see An et al., 2020; Frazierand Drovandi, 2020; Priddle et al., 2020). Marchand et al. (2017) consider three models, each15

100 0 10000.010.020.030.04 = , = . n = 750tv = 0.51 ExactsemiBSL n = 750tv = 0.08

ExactsemiBSL TKDE3 s = , = . -5 n = 750tv = 0.350.2 0.4 0.6 0.8 ExactsemiBSL n = 750tv = 0.080.2 0.4 0.6 0.8 ExactsemiBSL TKDE1 n = 20000tv = 0.210.2 0.4 0.6 0.8 ExactsemiBSL

Figure 4:

Bivariate posterior approximations and true marginal summary statistic distributions forthe MA(2) example – plot 2. Similar to Figure 3, the columns (left to right) denote the true marginalsummary statistic distributions, semiBSL with n = 750 , semiBSL TKDE with n = 750 and semiBSL with n = 20000 . Iterations P a r a m e t e r V a l ue Iterations -0.500.511.5 P a r a m e t e r V a l ue Iterations -0.500.511.5 P a r a m e t e r V a l ue Iterations -0.500.511.5 P a r a m e t e r V a l ue Iterations -0.500.511.5 P a r a m e t e r V a l ue Figure 5:

Trace plots corresponding to the results in Figure 4 (in the respective order). θ is green and θ is red. assuming that toads take refuge during the day and forage throughout the night. The modelsdiffer in their returning behaviour; here we expressly consider the random return model. Weprovide only a brief overview of the model herein, and refer the reader to Marchand et al.(2017) for more details.To simulate from the model, we draw an overnight displacement from the Levy alpha-stable16istribution S ( α, ξ ) , where ≤ α ≤ and ξ > . At the end of the night, toads return totheir previous refuge site with probability p , or take refuge at their current overnight displace-ment. In the event of a return on day i , the refuge site is chosen randomly from the set ofprevious refuge sites, thereby giving higher weighting to sites that have been visited multi-ple times. Here y is the refuge locations of n t = 66 toads over n d = 63 days, generated at θ true = ( α, ξ, p ) (cid:62) = (1 . , , . (cid:62) .The summary statistic is 48-dimensional, and is constructed as follows. For each toad, we splitthe observed data in two, corresponding to displacements less than or greater than 10m. Wecount the number of absolute displacements less than 10m. For the latter dataset, we ﬁndthe distance moved distribution at time lags 1, 2, 4 and 8 days, and compute both the log ofthe differences in the , . , . . . , quantiles and the median. For this example, the marginalsummary statistic distributions are roughly Gaussian (see Appendix A, Figure 10), meaningsBSL or wBSL would likely perform well. However, semiBSL (and wsemiBSL) will provideadditional robustness over their Gaussian counterparts with little additional computation andso we would generally advocate to use these methods even for such models. Of course, TKDEis not necessary for this example.We ﬁnd that n = 500 model simulations is adequate for standard semiBSL. We compare theoutput of standard semiBSL to wsemiBSL with n = 250 ( γ = 0 . ), n = 100 ( γ = 0 . ) and n = 50 ( γ = 0 ) – results are shown in Figure 6. For all cases, the wsemiBSL posterior approximation isclose to the standard semiBSL posterior approximation. With complete shrinkage ( γ = 0 ), weare able to reduce the number of model simulations by an order of magnitude. The M/G/1 queueing model is a stochastic single-server queue model whereby ‘customers’arrive according to a Poisson process and service times have a general distribution. Here weexpressly consider the case where service times are U ( θ , θ ) , as this has been a popular choicein other likelihood-free literature (see e.g. An et al., 2020; Blum and Franc¸ois, 2010). The timebetween arrivals is Exp ( θ ) distributed. We assume that only the inter-departure times areknown, and take this to be the observed data y . We observe 50 inter-departure times (corre-sponding to 51 customers) and set s y = y , generated at θ true = ( θ , θ , θ ) (cid:62) = (1 , , . (cid:62) . Theprior is U (0 , × U (0 , × U (0 , . on ( θ , θ − θ , θ ) .The marginal summary statistic distributions are right skewed with moderate kurtosis (seeAppendix A, Figure 11). Thus, for our TKDE method, it would be reasonable to use semiBSLTKDE0 or semiBSL TKDE1. wsemiBSL does not provide additional beneﬁt for this examplesince the summary statistics have very low correlation. We run semiBSL TKDE0, semiBSLTKDE1 and standard semiBSL. We compare the results of each sampler to the ‘true’ poste-rior, obtained using the MCMC scheme for the M/G/1 queue model of Shestopaloff and Neal(2014). We use n = 500 to estimate the summary statistic likelihood for semiBSL.Bivariate posterior approximations are shown in Figure 7. Both semiBSL TKDE methods pro-duce more accurate posterior approximations than standard semiBSL. semiBSL TKDE esti-mates the θ marginal distribution more accurately than semiBSL; the θ and θ marginals aresimilar. The additional log transformation (in semiBSL TKDE1 compared to semiBSL TKDE0)slightly improves the accuracy of the posterior approximation for this example, as evidenced17 v =0.191.4 1.5 1.6 1.7 1.8 1.9303234363840 semiBSL, n = 500wsemiBSL, = 0, n = 50 tv =0.21.4 1.5 1.6 1.7 1.8 1.90.560.580.60.620.64 p tv =0.1630 32 34 36 380.560.580.60.620.64 p tv =0.131.4 1.5 1.6 1.7 1.8 1.9303234363840 semiBSL, n = 500wsemiBSL, = 0.3, n = 100 tv =0.131.4 1.5 1.6 1.7 1.8 1.90.560.580.60.620.64 p tv =0.1330 32 34 36 380.560.580.60.620.64 p tv =0.091.4 1.5 1.6 1.7 1.8 1.9303234363840 semiBSL, n = 500wsemiBSL, = 0.7, n = 250 tv =0.11.4 1.5 1.6 1.7 1.8 1.90.560.580.60.620.64 p tv =0.0930 32 34 36 380.560.580.60.620.64 p Figure 6:

Contour plots of the bivariate posterior approximations for the toad model. Rows (top tobottom) correspond to the n and γ combination, and columns correspond to each pair of parameters. θ true is shown as a red dot. The total variation distance between each bivariate semiBSL posterior ap-proximation and each bivariate wsemiBSL posterior approximation is shown in the bottom right of eachpanel. by the total variation distance. α -Stable Stochastic Volatility Model Stochastic volatility models (SVMs) are commonly used in econometric applications, such asthe modelling of ﬁnancial returns (see Vankov et al., 2019). In SVMs, the observed data areassumed to follow a latent stochastic process in evenly spaced discrete time. The return process18 v =0.810.5 1 1.5 2 ExactsemiBSL tv =0.740.5 1 1.5 2 tv =0.454 4.5 5 5.5 6 tv =0.550.5 1 1.5 2 ExactsemiBSL TKDE0 tv =0.340.5 1 1.5 2 tv =0.464 4.5 5 5.5 6 tv =0.490.5 1 1.5 2 ExactsemiBSL TKDE1 tv =0.320.5 1 1.5 2 tv =0.424 4.5 5 5.5 6 Figure 7:

Contour plots of the bivariate posterior approximations for the M/G/1 example. Columns(left to right) correspond to each bivariate marginal ( θ , θ ) , ( θ , θ ) and ( θ , θ ) , respectively. Rowscorrespond to the method used. ‘Exact’ denotes the posterior approximation obtained using the methodof Shestopaloff and Neal (2014). θ true is shown as a red dot. is given by: y t = exp (cid:16) x t (cid:17) v t x t ∼ N ( µ + φ ( x t − − µ ) , σ t ) where y t is the observed data at time t , which directly depends on the log volatility x t and theshock v t ; µ is the modal instantaneous volatility; φ is the persistence parameter, and σ t is thevariance of x t . The typical model formulation uses a Gaussian shock parameter, v t ∼ N ( · , · ) (Kim et al., 1998); however, due to the heavy tailedness of asset returns, more recent studieshave found the stable distribution to be more appropriate (Casarin, 2004). That is, we assume v t ∼ SD ( α, β, κ, η ) , where α , β , κ and η control the tail heaviness (with a lower α having heavier19ails), the skewness, the scale and the location, respectively. Despite the additional ﬂexibilityinherited by this family of SVMs, the PDF of the stable distribution is unavailable in closedform for most parameter values. This motivates the development of likelihood-free algorithmssuch as ABC and BSL for heavy tailed distributions (see, for example, Ebert et al., 2019; Vankovet al., 2019). The extremely heavy tails of y t may cause sBSL and standard semiBSL to fail.We test our methods on two datasets. We infer θ = ( α, β ) (cid:62) and assume the remaining parame-ters are known and ﬁxed, such that: µ = 5 , φ = 1 , κ = 1 , η = 0 and σ = 0 . for each dataset. Weset θ true = (1 . , . (cid:62) and θ true = (0 . , . (cid:62) for datasets 1 and 2, respectively and generate 50observations from the α − stable SVM and set this to be y in each case. We take s y = y . Giventhe marginal summary statistic distributions are symmetric and heavily skewed (see Figure 8),we use semiBSL TKDE3. We do not consider wsemiBSL for this model, since there is only a lowdegree of correlation between the pairwise statistics. The results are compared directly to stan-dard semiBSL. We ﬁnd n = 2000 is sufﬁcient to control the variance for semiBSL TKDE for eachdataset, and n = 20000 is required for semiBSL for the θ = (1 . , . (cid:62) dataset. We are unable toﬁnd a large enough n to accurately estimate the standard semi-parametric synthetic likelihoodfor the θ = (0 . , . (cid:62) dataset. Similar to the MA (2) example, we also consider n = 2000 forsemiBSL for each dataset – the same value of n we use for semiBSL TKDE.Marginal posterior approximations are shown in Figure 8. The corresponding trace plots areshown in Figure 9. We observe similar results to the MA (2) example. For n = 2000 , for stan-dard semiBSL, the acceptance rate is low (extremely low for dataset 2), while we observe highacceptance rates and good mixing for semiBSL TKDE for both datasets. The posterior approx-imations for dataset 1 obtained using semiBSL are reasonable, but are poor for dataset 2. Onthe other hand, the posterior approximations for semiBSL TKDE for each dataset are smoothand have reasonable support for the true parameter value. When n is increased to 20000 forstandard semiBSL, the posterior approximation is smoother; however there is less support forthe true parameter value than the posterior approximation generated using semiBSL TKDE. In this article, we proposed two extensions to semiBSL. First, we extended the wBSL methodof Priddle et al. (2020) to the semiBSL context. We demonstrated in a number of empirical ex-amples that our new method, wsemiBSL, is able to produce accurate posterior approximationswith up to an order of magnitude less model simulations than standard semiBSL, even whenthe summary statistic deviates from normality. We also proposed a new method to estimate themarginal summary statistic distributions in semiBSL using TKDE. We show several exampleswhere standard semiBSL will fail due to heavy kurtosis in the marginal summary statistic dis-tributions, whereas our semiBSL TKDE method produces accurate posterior approximationsin each case. Furthermore, we showed that wsemiBSL can be used in conjunction with TKDEfor both improved computational efﬁciency and robustness to irregular summary statistic dis-tributions.There are a few limitations to the proposed methods. For wsemiBSL, we ﬁnd that there is arather strong dependence between the regularity of the marginal summary statistic distribu-tions and the potential for large reductions in n . That is, the efﬁciency gain appears to diminishas the marginal summary statistic distributions become increasingly non-Gaussian.20

20 -15 -10 -5 0 5 10 15 20 s D en s i t y semiBSL TKDE3, n = 2000semiBSL, n = 2000semiBSL, n = 20000True parameter -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 100.511.522.5 D en s i t y -20 -15 -10 -5 0 5 10 15 20 s D en s i t y -0.2 0 0.2 0.4 0.6 0.8 1 1.200.511.522.533.54 D en s i t y Figure 8:

Posterior approximations and marginal summary statistic distributions for the α − stable SVM.Top row corresponds to dataset 1, and bottom row corresponds to dataset 2. The columns (left to right)correspond to the marginal summary statistic distributions, and the parameters α and β , respectively. Iterations -1012 P a r a m e t e r V a l ue semiBSL TKDE3, n = 2000 Iterations -0.500.511.5 P a r a m e t e r V a l ue semiBSL, n = 2000 Iterations -1012 P a r a m e t e r V a l ue semiBSL, n = 20000 Iterations -0.500.511.5 P a r a m e t e r V a l ue Iterations P a r a m e t e r V a l ue Figure 9:

Trace plots corresponding to Figure 8 results. Rows correspond to each dataset – top rowwhen θ = (1 . , . (cid:62) and bottom row when θ = (0 . , . (cid:62) . Columns correspond to the method andnumber of model simulations combination. In future work, it may be of interest to consider ways to further increase the robustness ofsemiBSL to non-linear dependence structures. One way of overcoming such a problem may bevia more advanced multivariate transformations such as normalising ﬂows (see Rezende andMohamed, 2015; Papamakarios et al., 2017). In addition, sBSL is known to be adversely affectedin the setting of model misspeciﬁcation, or more speciﬁcally, summary statistic incompatibility21see Frazier and Drovandi, 2020). Future work may investigate the equivalent problem in thecontext of semiBSL.

References

An, Z., Nott, D. J., and Drovandi, C. (2020). Robust Bayesian synthetic likelihood via a semi-parametric approach.

Statistics and Computing , 30(3):543–557.An, Z., South, L. F., Nott, D. J., and Drovandi, C. C. (2019). Accelerating Bayesian syntheticlikelihood with the graphical lasso.

Journal of Computational and Graphical Statistics , 28(2):471–475.Blum, M. G. and Franc¸ois, O. (2010). Non-linear regression models for approximate Bayesiancomputation.

Statistics and Computing , 20(1):63–73.Boudt, K., Cornelissen, J., and Croux, C. (2012). The Gaussian rank correlation estimator: ro-bustness properties.

Statistics and Computing , 22(2):471–483.Breiman, L., Meisel, W., and Purcell, E. (1977). Variable kernel estimates of multivariate densi-ties.

Technometrics , 19(2):135–144.Buch-Larsen, T., Nielsen, J. P., Guill´en, M., and Bolanc´e, C. (2005). Kernel density estimationfor heavy-tailed distributions using the Champernowne transformation.

Statistics , 39(6):503–516.Casarin, R. (2004). Bayesian inference for generalised Markov switching stochastic volatilitymodels.

CEREMADE Journal Working Paper , (0414).Chiachio, M., Beck, J. L., Chiachio, J., and Rus, G. (2014). Approximate Bayesian computationby subset simulation.

SIAM Journal on Scientiﬁc Computing , 36(3):A1339–A1358.Ebert, A., Pudlo, P., Mengersen, K., and Wu, P. (2019). Combined parameter and state inferencewith automatically calibrated ABC. arXiv preprint arXiv:1910.14227 .Everitt, R. G. (2017). Bootstrapped synthetic likelihood. arXiv preprint arXiv:1711.05825 .Fasiolo, M., Wood, S. N., Hartig, F., Bravington, M. V., et al. (2018). An extended empirical sad-dlepoint approximation for intractable likelihoods.

Electronic Journal of Statistics , 12(1):1544–1578.Frazier, D. T. and Drovandi, C. (2020). Robust approximate Bayesian inference with syntheticlikelihood. arXiv preprint arXiv:1904.04551 .Frazier, D. T., Nott, D. J., Drovandi, C., and Kohn, R. (2019). Bayesian inference using syntheticlikelihood: asymptotics and adjustments. arXiv preprint arXiv:1902.04827 .Hauenstein, S., Fattebert, J., Gr ¨uebler, M. U., Naef-Daenzer, B., Pe’er, G., and Hartig, F. (2019).Calibrating an individual-based movement model to predict functional connectivity for littleowls.

Ecological Applications , 29(4):e01873.Jones, M. C. and Pewsey, A. (2009). Sinh-arcsinh distributions.

Biometrika , 96(4):761–780.22essy, A., Lewin, A., and Strimmer, K. (2018). Optimal whitening and decorrelation.

TheAmerican Statistician , 72(4):309–314.Kim, S., Shephard, N., and Chib, S. (1998). Stochastic volatility: likelihood inference and com-parison with ARCH models.

The Review of Economic Studies , 65(3):361–393.Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016).Improved variational inference with inverse autoregressive ﬂow. In

Advances in Neural Infor-mation Processing Systems , pages 4743–4751.Lintusaari, J., Blomstedt, P., Rose, B., Sivula, T., Gutmann, M. U., Kaski, S., and Corander, J.(2019). Resolving outbreak dynamics using approximate Bayesian computation for stochasticbirth–death models.

Wellcome Open Research , 4(14):14.Loftsgaarden, D. O., Quesenberry, C. P., et al. (1965). A nonparametric estimate of a multivari-ate density function.

The Annals of Mathematical Statistics , 36(3):1049–1051.Marchand, P., Boenke, M., and Green, D. M. (2017). A stochastic movement model repro-duces patterns of site ﬁdelity and long-distance dispersal in a population of Fowler’s toads(Anaxyrus fowleri).

Ecological Modelling , 360:63–69.Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012). Approximate Bayesian computa-tional methods.

Statistics and Computing , 22(6):1167–1180.McKinley, T. J., Vernon, I., Andrianakis, I., McCreesh, N., Oakley, J. E., Nsubuga, R. N., Gold-stein, M., White, R. G., et al. (2018). Approximate Bayesian computation and simulation-based inference for complex stochastic epidemic models.

Statistical Science , 33(1):4–18.Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization.

The ComputerJournal , 7(4):308–313.Ong, V. M., Nott, D. J., Tran, M.-N., Sisson, S. A., and Drovandi, C. C. (2018a). Likelihood-free inference in high dimensions with synthetic likelihood.

Computational Statistics & DataAnalysis , 128:271–291.Ong, V. M., Nott, D. J., Tran, M.-N., Sisson, S. A., and Drovandi, C. C. (2018b). Variational Bayeswith synthetic likelihood.

Statistics and Computing , 28(4):971–988.Papamakarios, G., Pavlakou, T., and Murray, I. (2017). Masked autoregressive ﬂow for densityestimation. In

Advances in Neural Information Processing Systems , pages 2338–2347.Papamakarios, G., Sterratt, D. C., and Murray, I. (2018). Sequential neural likelihood: Fastlikelihood-free inference with autoregressive ﬂows. arXiv preprint arXiv:1805.07226 .Parno, M. D. and Marzouk, Y. M. (2018). Transport map accelerated Markov chain Monte Carlo.

SIAM/ASA Journal on Uncertainty Quantiﬁcation , 6(2):645–682.Picchini, U., Simola, U., and Corander, J. (2020). Adaptive MCMC for synthetic likelihoods andcorrelated synthetic likelihoods. arXiv preprint arXiv:2004.04558 .Price, L. F., Drovandi, C. C., Lee, A., and Nott, D. J. (2018). Bayesian synthetic likelihood.

Journal of Computational and Graphical Statistics , 27(1):1–11.23riddle, J. W., Sisson, S. A., Frazier, D. T., and Drovandi, C. (2020). Efﬁcient Bayesian syntheticlikelihood with whitening transformations. arXiv preprint arXiv:1909.04857 .Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing ﬂows. arXivpreprint arXiv:1505.05770 .Shestopaloff, A. Y. and Neal, R. M. (2014). On Bayesian inference for the M/G/1 queue withefﬁcient MCMC sampling. arXiv preprint arXiv:1401.5548 .Silverman, B. W. (1986).

Density estimation for statistics and data analysis , volume 26. CRC press.Sisson, S. A., Fan, Y., and Beaumont, M. A. (2018a).

Handbook of Approximate Bayesian Computa-tion . Chapman and Hall.Sisson, S. A., Fan, Y., and Beaumont, M. A. (2018b). Overview of approximate Bayesian com-putation. In Sisson, S. A., Fan, Y., and Beaumont, M. A., editors,

Handbook of ApproximateBayesian Computation , pages 3–54. Chapman and Hall/CRC Press.Trivedi, P. K., Zimmer, D. M., et al. (2007). Copula modeling: an introduction for practitioners.

Foundations and Trends R (cid:13) in Econometrics , 1(1):1–111.Tsai, A. C., Liou, M., Simak, M., and Cheng, P. E. (2017). On hyperbolic transformations tonormality. Computational Statistics & Data Analysis , 115:250–266.Vankov, E. R., Guindani, M., Ensor, K. B., et al. (2019). Filtering and estimation for a class ofstochastic volatility models with intractable likelihoods.

Bayesian Analysis , 14(1):29–52.Varghese, A., Drovandi, C., Mira, A., and Mengersen, K. (2020). Estimating a novel stochas-tic model for within-ﬁeld disease dynamics of banana bunchy top virus via approximateBayesian computation.

PLOS Computational Biology , 16(5):e1007878.Wand, M. P., Marron, J. S., and Ruppert, D. (1991). Transformations in density estimation.

Journal of the American Statistical Association , 86(414):343–353.Warton, D. I. (2008). Penalized normal likelihood and ridge regularization of correlation andcovariance matrices.

Journal of the American Statistical Association , 103(481):340–349.Wood, S. N. (2010). Statistical inference for noisy nonlinear ecological dynamic systems.

Nature ,466(7310):1102.Yang, C., Duraiswami, R., Gumerov, N. A., and Davis, L. (2003).

Improved fast Gauss transformand efﬁcient kernel density estimation . IEEE. 24

Summary Statistic Distributions

500 1000 1500 s D en s i t y s s s s s s s s D en s i t y s

024 5 10 15 s s s s s s s D en s i t y s s s s

024 3 4 5 s s s s D en s i t y s s s s s s s s D en s i t y s s s s s s s s D en s i t y s s s

024 3 3.5 4 s

024 3 4 5 s s s Figure 10:

Marginal summary statistic distributions for the Fowler’s toads example.

10 0 10 20 30 40 50 60 70 s D en s i t y Figure 11: