[PDF] Inverse Gaussian Process regression for likelihood-free inference

Abstract

In this work we consider Bayesian inference problems with intractable likelihood functions. We present a method to compute an approximate of the posterior with a limited number of model simulations. The method features an inverse Gaussian Process regression (IGPR), i.e., one from the output of a simulation model to the input of it. Within the method, we provide an adaptive algorithm with a tempering procedure to construct the approximations of the marginal posterior distributions. With examples we demonstrate that IGPR has a competitive performance compared to some commonly used algorithms, especially in terms of statistical stability and computational efficiency, while the price to pay is that it can only compute a weighted Gaussian approximation of the marginal posteriors.

Full PDF

aa r X i v : . [ s t a t . C O ] F e b Inverse Gaussian Process regression for likelihood-free inference ∗ Hongqiao Wang † , Ziqiao Ao ‡ , Tengchao Yu § , and Jinglai Li ¶ Abstract.

In this work we consider Bayesian inference problems with intractable likelihood functions. Wepresent a method to compute an approximate of the posterior with a limited number of modelsimulations. The method features an inverse Gaussian Process regression (IGPR), i.e., one fromthe output of a simulation model to the input of it. Within the method, we provide an adaptivealgorithm with a tempering procedure to construct the approximations of the marginal posteriordistributions. With examples we demonstrate that IGPR has a competitive performance comparedto some commonly used algorithms, especially in terms of statistical stability and computationaleﬃciency, while the price to pay is that it can only compute a weighted Gaussian approximation ofthe marginal posteriors.

Key words.

Approximate Bayesian computation, Bayesian inference, Gaussian process regression, likelihoodfunction

AMS subject classiﬁcations.

1. Introduction.

Bayesian inference [11] provides a natural framework for estimating pa-rameters of interest (PoI) from indirect, incomplete and noisy observations. A major ad-vantage of the Bayesian framework is that the posterior distribution is a probabilistic repre-sentation of the PoI, and thus can characterize the uncertainty information in the inferenceresults. Computing the posterior distribution, i.e. the distribution of the PoI conditional onthe observations, is therefore a central task of Bayesian inference. In many practical prob-lems, often the likelihood function is intractable (namely, evaluating the likelihood function iseither computationally prohibitive or simply impossible) [23], rendering most commonly usedmethods, such the Markov chain Monte Carlo (MCMC) simulation [2] and the variationalBayes method [10, 4] infeasible. Likelihood-free inference methods, which do not require atractable form of the likelihood function, are needed to address such problems.Considerable eﬀorts have been devoted to the development of approximate inference meth-ods and in what follows we brieﬂy overview three classes of approximate inference meth-ods, which use diﬀerent strategies to avoid the likelihood evaluation. First the approximateBayesian computations (ABC) [3, 23] are probably the most popular class of approximateinference methods. The main idea behind ABC is to generate samples from a prescribedscheme (e.g. from the prior distribution) and then determine whether a sample of the POIcan be accepted as a posterior sample based on the discrepancy between the data simulatedfrom the sample and the observed data. The ABC methods have been successfully applied to ∗ February 23, 2021.

Funding:

This work was partially supported by the NSFC under grant number 11301337. † School of Mathematics and Statistics, Central South University, Changsha, China. ‡ School of Mathematics, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK. § School of Mathematical Sciences, Shanghai Jiao Tong University, 800 Dongchuan Rd, Shanghai 200240, China. ¶ Corresponding author, School of Mathematics, University of Birmingham, Edgbaston, Birmingham B15 2TT,UK. Email: [email protected]. a wide range of practical problems, but they typically require a large number of simulationsof the underlying mathematical model, which makes them highly costly for problems withcomputationally intensive mathematical models. Another often used strategy is to constructcomputationally inexpensive surrogate or approximate model of the likelihood or a relatedfunction from the simulated data, and then use the resulting surrogate models in a standardposterior computation such as MCMC. A typical example of such methods is the syntheticlikelihood [29] that approximates the likelihood function from a normal density estimate forthe summary statistics and others such as [25, 20]. Finally in this work our interest lies in atype of methods that aim to directly approximate the posterior distribution from the simu-lated samples. Speciﬁcally they assume a certain conditional distribution density model forthe posterior, and then train the conditional density model through the simulated samples. Inwhat follows we shall refer to such approaches as the posterior approximation methods. Worksalong this line include [16, 19], just to name a few. Note that in the conditional density modelsthe input is the observed data and the output of it is the distribution of POI. Since in thistype of methods the conditional density models are often constructed with complex modelssuch as artiﬁcial neural networks, training such models are often computationally intensive.In many practical problems often the users may only be interested in the marginal pos-teriors of each parameters, rather than the joint distribution as a whole. The goal of thiswork is to propose a method to compute the marginal posterior distributions which can beeﬃciently and stably implemented. The basic idea of the proposed method is to use GaussianProcess (GP) regression [22, 14, 28] as the conditional density model for approximating theactual posterior. Unlike many usual regression methods that provide a point estimate of theresponses, the GP method computes a Gaussian distribution of the response conditional onthe predictors, which provides a natural framework to approximate the posterior. It shouldbe noted that the GP methods have been previously used to accelerate the posterior compu-tation in several works [13, 27, 1]. In these works, GP is mainly used as a regression tool,constructed to approximate the likelihood function or the log-likelihood function: namely thepredictor is the PoI and the response is the value of the (log)-likelihood. As a result thesemethods can not handle problems where the likelihood function is intractable without extrameasures. In the present work we use the GP model to directly approximate the marginalposterior distribution, which is a mapping from observation data to the distribution of PoI.As will be discussed later, in the present setup, the data and the POI are respectively theoutput and the input of a simulation model, and so our method essentially constructs a GPregression from the output of a model to the input of it. For this output-to-input structurewe refer to the proposed method as Inverse Gaussian Process Regression (IGPR).In all the posterior approximation methods including IGPR, one of the most importantissues is to construct the training set [28], i.e. to determine locations where to perform thesimulation of the mathematical model. Since we consider problems with expensive simulationmodels, determine the sampling points are essential for the computational eﬃciency. Thistask, however, is particularly challenging in the present problems, as the inputs of the condi-tion distribution model (namely, the GP model in our method) are actually the output of asimulation model, which makes it impossible to draw a sample from an exact location even ifsuch a location is speciﬁed. To address the issue, we design an adaptive algorithm to generatesample points and update the resulting GP approximation of the posterior. Since the training

NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 3 process of GP is less costly especially as we use the local GP model that will be explainedlater, the IGPR method is usually more eﬃcient to implement than those based on artiﬁcialneural networks, while the price we pay is that IGPR can only approximate the marginalposteriors. Theoretically we are able to show that the IGPR method can correctly recover themean of the marginal posterior distribution in the large sample limit. Finally our numericalexperiments demonstrate that, IGPR has competitive performance in terms of accuracy, it ismore statistically robust, in the sense that its results are subject to less statistical ﬂuctuations,and ﬁnally it is more computationally eﬃcient than the methods based on artiﬁcial neuralnetworks.The rest of the paper is organized as follows. In Section 2 we discuss the setup of theapproximate inference problems, and provide some preliminaries that will be used in theproposed method. In Section 3 we describe the proposed IGPR method in details and inSection 4 we provide some examples to demonstrate its performance. Finally we oﬀer someclosing remarks in Section 5.

2. Problem setup and preliminaries.

In this section, we discuss some important prelim-inaries of the proposed method, and we start with the basic setup of the inference problemsthat we want to solve.

Let us consider a Bayesian infer-ence problem: let θ ∈ R n θ be the parameters of interest, and d ∈ R n d be the observed data,and we want to compute the posterior distribution:(2.1) π ( θ | d ) = π ( d | θ ) π ( θ ) π ( d ) , where π ( d | θ ) is the likelihood function and π ( θ ) is the prior distribution on θ . In certainproblems, the likelihood function can be directly evaluated, and in this case the posteriorsamples can be drawn using the MCMC simulation. In this work, however, we assume thatevaluation of the likelihood function π ( d | θ ) is infeasible, while it is possible to generate datafrom the likelihood function for a given value of θ . A typical setup for such problems is thatthe data is a function of both the parameter θ and certain random variables ξ :(2.2) d = G ( θ , ξ ) , where the distribution of ξ is known. In some situations ξ can be a stochastic process or arandom ﬁeld. It should be clear that when the model G ( · , · ) is complex, for example describedby a nonlinear diﬀerential equation system, it is extremely diﬃcult to derive the associatedlikelihood function. This type of problems appear quite often in the ﬁeld of computationalbiology ranging from population dynamics to biochemical networks [15, 8]. On the otherhand, in such problems, it is required that for a given parameter ˆ θ , synthetic data can begenerated by performing a simulation of the underlying mathematical model G ( ˆ θ , ξ ). To avoidconfusion, in what follows we refer to the synthetic data as samples , denoted as ˆ d , and theobserved data as data , denoted as ˜ d . Due to the intractability of the likelihoodfunction, it is not possible to compute the posterior distribution (2.1) exactly. Instead, we seek

H. WANG, Z. AO, T. YU AND J. LI to ﬁnd an approximate posterior distribution for the problem. As is discussed in Section 1,one popular approach to do this is the ABC method. The basic idea of ABC is to approximatethe intractable likelihood function by(2.3) π ǫ (˜ d | θ ) = Z δ ǫ ( ρ (˜ d , ˆ d )) π (ˆ d | θ ) d d where δ ǫ ( · ) is an indicator function: δ ǫ ( z ) = 1 if z ≤ ǫ and δ ǫ ( z ) = 0 otherwise, ρ ( · , · ) is adistance measure of two data points, and ǫ > ρ is taken to be the Euclidian distance unless otherwisely stated. It should be clearthat π ǫ (˜ d | θ ) = π (˜ d | θ ) for ǫ = 0. The posterior distribution π ( θ | ˜ d ) can thus be approximatedby(2.4) π ǫ ( θ | ˜ d ) = 1 π ǫ (˜ d ) Z δ ǫ ( ρ abc (˜ d , ˆ d )) π (ˆ d | ˆ θ ) π ( θ ) d d , where π ǫ (˜ d ) is the normalization constant. The very basic ABC rejection algorithm generatesa set of approximate posterior samples by repeatedly performing the following steps,1. draw ˆ θ from π ( θ );2. generate ˆ d ∼ π ( ·| ˆ θ );3. if ρ ( ˜ d , ˆ d ) < ǫ , accept ˆ θ .We reinstate that the procedure given above is the very basic version of ABC with rejection(ABC-REJ), and more sophisticated ABC algorithms such as the ABC with MCMC [17] andthe ABC with sequential Monte Carlo (ABC-SMC) [24] are also available. We refer to thereview articles [3, 23] as well as the references therein for more information on the ABCalgorithms. We now provide a very brief introduction to the GPregression, which is the main tool of the proposed method, and we want to note here thatthe GP model is slightly modiﬁed for the use in approximate inference. Simply speaking theGP regression performs a nonparametric regression in a Bayesian framework [28]. Speciﬁcally,given m data pairs D = { ( x ∗ i , y ∗ i ) } mi =1 with x ∈ R n x and y ∈ R being the explanatory andresponse variables respectively, the task is to predict the value of y at a new point x , andmore precisely we want to compute the conditional distribution π ( y | x , D ). A standard modelfor regression is to assume that the explanatory and response variables are related via anunderlying function:(2.5) y = f ( x ) + ζ, where ζ is an observation noise following N (0 , ν ). The main idea of the GP method is toassume that the underlying function f ( x ) is a Gaussian random ﬁeld deﬁned on R n x , whosemean is µ ( x ) and covariance is speciﬁed by a kernel function k ( x , x ′ ), namely,cov[ f ( x ) , f ( x ′ )] = k ( x , x ′ ) . As a result the joint distribution of ( Y , f ( x )) is,(2.6) (cid:20) Y f ( x ) (cid:21) ∼ N (cid:18) µ ( X ) µ ( x ) , (cid:20) K ( X , X ) + νI K ( X , x ) K ( x , X ) K ( x , x ) (cid:21)(cid:19) , NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 5 where Y = [ y ∗ , . . . , y ∗ m ], X = [ x ∗ , . . . , x ∗ m ], I is the identity matrix, and the notation K ( A , B )denotes the matrix of the covariance evaluated at all pairs of points in set A and in set B .We note here that Eq. (2.6) is the usual form of the GP model with noisy observations [28],where the noise-free prediction f ( x ) is sought. However, for our purposes, we are actuallyinterested in the noisy prediction to account for the intrinsic uncertainty in the underlyingmodel itself (i.e., even if f ( x ) is exactly know, y is subject to uncertainty due to the presenceof noise), and the matter will be discussed further in Section 3. Here we can derive the jointdistribution of ( Y , y ):(2.7) (cid:20) Y y (cid:21) ∼ N (cid:18) µ ( X ) µ ( x ) , (cid:20) K ( X , X ) + νI K ( X , x ) K ( x , X ) K ( x , x ) + ν (cid:21)(cid:19) . It follows immediately that the conditional distribution π ( y | x , D ) is also Gaussian:(2.8a) y | x , D ∼ N ( µ GP ( x ) , σ ( x )) , and the posterior mean µ GP and variance σ are available in explicit forms: µ GP ( x ) = µ ( x ) + k ( x , X )( k ( X , X ) + νI ) − ( Y − µ ( x ) m ) , (2.8b) σ ( x ) = k ( x , x ) − k ( x , X )( k ( X , X ) + νI ) − k ( X , x ) + ν, (2.8c)where m is a m -dimensional vector of ones. In what follows we use the notation π GP ( y | x , D )to denote the Gaussian distribution given by the GP model in Eq. (2.8). It is important tonote here that the noise covariance ν is usually not known in advance, and in practice, it alongwith other hyper-parameters is determined by the maximum likelihood estimation (MLE). Werefer to [28] for more details.

3. The IGPR method.

We now discuss how to compute the approximate posterior dis-tribution using the GP model, and in particular we want to reinstate here that our goal is tocompute the posterior marginal distribution of each parameter. Speciﬁcally, starting with abasic version of IGPR model in Section 3.1, and making use of a formulation based on proposalprior as is shown in Section 3.2, we develop an adaptive IGPR algorithm in Section 3.3.

In this section we present the basic version of the IGPRmethod. We want to compute the marginal posterior distributions p ( θ j | d ) where θ j is the j -thcomponent of θ for any 0 < j ≤ n θ . Now suppose that we have a set of sample points generatedfrom the likelihood function { ( ˆ θ i , ˆ d i ) } mi =1 . Deﬁne the training set as D j = { ..., (ˆ d i , ˆ θ ji ) , ... } ,and by assuming θ j ( d ) is Gaussian random ﬁeld, we can construct a GP model π GP ( θ j | d , D j )from D j . Plugging the observed data ˜ d into the GP model yields the GP based posteriordistribution π GP ( θ j | ˜ d , D j ), which can be used as an approximation of the true posterior π ( θ j | ˜ d ). We refer to the GP-based approximate posterior as the IGPR model. In particularwe note that the variance in the IGPR model consists of two parts: one corresponds to theactually variance of the posterior, and the other corresponds to the model uncertainty of GPdue to the limited number of samples. To obtain the IGPR model, we need to construct atraining set D j , and a natural way of doing so is to draw the training sample points from π ( θ , d ): namely we ﬁrst draw samples of the parameter of θ from the prior distribution π ( θ ), H. WANG, Z. AO, T. YU AND J. LI obtaining ˆ θ , ..., ˆ θ m , and for each sample ˆ θ i , we then generate a ˆ d i from the likelihood function π ( ·| ˆ θ i ). Next following the idea of ABC, we introduce a cut-oﬀ distance ǫ , and only includethe samples whose distance to the observed data point ˜ d is smaller than ǫ in the training set,yielding, D jǫ = { ( ˆ d i , ˆ θ ji ) | ρ ( ˆ d i , ˜ d ) < ǫ } mi =1 . This step can also be understood as employing local GP construction [12, 31] and the mo-tivation is two-fold. First, it eliminates the inﬂuence of samples that are far away from theactual data point ˜ d , which can potentially improve the accuracy of the GP prediction at ˜ d .Second, for problems where the data set is massively large, there may be some computationalissues with the GP model (e.g., the inversion of the large covariance matrix ( k ( X , X ) + σ n I )may not be numerically stable) and the use of local GP alleviates these problems. Using theresulting training set D jǫ and the GP model described in Section 2.3 , we are able to computea Gaussian approximation of each marginal distribution, π GP ( θ j | ˜ d , D jǫ ) and we deﬁne(3.1) ψ GP ( θ ) = n θ Y j =1 π GP ( θ j | ˜ d , D jǫ ) , which can be regarded as an approximation of the joint posterior. Alg. 3.1 provides the pseudocode of this procedure. Algorithm 3.1

The basic IGPR algorithm • draw ˆ θ , ..., ˆ θ m from π ( θ ); • for i = 1 ...m , draw ˆ d i ∼ π ( d | ˆ θ i ); • for j = 1 to n θ do – let D jǫ = { ( ˆ d i , ˆ θ ji ) | ρ ( ˆ d i , ˜ d ) < ǫ } mi =1 ; – construct the GP model π GP ( θ j | d , D jǫ ) from data set D jǫ ; • let ψ ( θ ) = Q n θ j =1 π GP ( θ j | ˜ d , D jǫ ).The algorithm presented above is an analogy of ABC-REJ in the ABC family, in thatboth of them directly draw samples from the prior distribution and then select samples basedon their distance to the observed data point. Compared to the two more complicated versionsof IGPR to be discussed later, an advantage of this version of IGPR model is that it canbe implemented in an on-line/oﬀ-line fashion: in the oﬀ-line stage a very large number ofsamples can be drawn from the model and form the data set for constructing the GP model;in the on-line stage, once an observed data becomes available, it can be directly fed into theGP model and yield the approximate posterior. To this end, though this version of IGPR isrelatively simple, it can be useful to applications where the online/oﬄine scheme is suitable. In Alg. 3.1, the samples of θ are drawn from the priordistribution π ( θ ). Since in many practical problems the procedure of generating ˆ d i from thelikelihood function π ( ·| ˆ θ i ) can be highly expensive, simply sampling according to the priordistribution may not be a good choice for the computational eﬃciency, especially when theposterior diﬀers signiﬁcantly from the prior. Intuitively a more eﬃcient way to determine the NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 7 training samples is to draw samples from a distribution that is close to the posterior, whichis discussed in this section.Suppose that we are given a distribution that can approximate the posterior, and following[19] it is referred to as the proposal prior π q ( θ ). In this section we shall discuss how toconstruct the IGPR model using such a proposal prior. First it is easy to see that the posteriordistribution can be written as,(3.2) π ( θ | d ) ∝ π ( d | θ ) π ( θ ) = π ( d | θ ) π q ( θ ) π ( θ ) π q ( θ ) ∝ π q ( θ | d ) π ( θ ) π q ( θ )where(3.3) π q ( θ | d ) = π ( d | θ ) π q ( θ ) π q ( d ) , π q ( d ) = Z π ( d | θ ) π q ( θ ) d θ . Next we can apply Algorithm 3.1 to π q ( θ | ˜ d ) in Eq. (3.3) (except that the samples are nowdrawn from π q instead of π ), and obtain the Gaussian approximation ψ ( θ ). It should be clearthat the Gaussian distribution ψ ( θ ) computed is an approximation of π q ( θ | d ) rather than theactual posterior distribution as it uses π q as the prior. Now replacing π q ( θ | d ) in Eq. (3.2)yields an approximation to the actual posterior π ( θ | ˜ d ), denoted by,(3.4) π post ( θ ) ∝ ψ ( θ ) π ( θ ) π q ( θ ) . Next we construct an independent Gaussian approximation of the prior,(3.5) φ ( θ ) = n θ Y j =1 N ( µ j , ( σ j ) ) , where ( µ j , ( σ j ) ) are respectively the prior mean and variance of θ j for j = 1 ...n θ . It shouldbe noted there that we can use such an independent Gaussian approximation of the priorbecause we only intend to compute the marginal posterior distributions. If we take π q ( θ ) tobe an independent Gaussian, π q ( θ ) ∼ n θ Y j =1 N ( µ jq , ( σ jq ) ) , we can derive that φ ( θ ) ∝ ψ ( θ ) φ ( θ ) is also in the form of:(3.6a) φ ( θ ) = n θ Y j =1 N ( µ jφ , ( σ jφ ) ) , with mean(3.6b) µ jφ = µ GP ( σ j GP ) − µ q ( σ jq ) + µ ( σ j ) σ j GP ) − σ jq ) + σ j ) , H. WANG, Z. AO, T. YU AND J. LI and variance(3.6c) σ j φ = 1 σ j GP ) − σ jq ) + σ j ) . It is easy to see that, for φ to be a well-deﬁned Gaussian distribution, we must have(3.7) 1( σ j GP ) − σ jq ) + 1( σ j ) > , and measures can be taken to ensure this in the numerical implementation: e.g., one canchoose π q such that σ jq ≤ σ j for all j = 1 ...n θ . Next it follows that the approximate posterior π post can be written as π post ∝ φ ( θ ) π ( θ ) φ ( θ ) . We note here that in Algorithm 3.2, the third step inside the “for”-loop is employed to makesure that the posterior variance is smaller than the prior variance which in turn ensures thatEq. (3.7) holds.

Algorithm 3.2

The proposal prior based IGPR algorithm • draw ˆ θ , ..., ˆ θ m from π q ( θ ); • for each ˆ θ i for i = 1 ...m , draw ˆ d i ∼ π ( d | ˆ θ i ); • for j = 1 to n θ do – let D j = { ( ˆ d i , ˆ θ ji ) | ρ ( ˆ d i , ˜ d ) < ǫ } mi =1 ; – construct the GP model π GP ( θ j | d , D j ) from data set D j ; – if σ j GP > σ j , let σ j GP = σ j ; • let ψ ( θ ) = Q n θ j =1 π GP ( θ j | ˜ d , D j ); • computer φ ( θ ) using Eqs. (3.6); • let π post ( θ ) ∝ φ ( θ ) π ( θ ) φ ( θ ) . In the proposal prior based method, the key is to obtain a good proposal π q . As has beendiscussed, the samples used to construct the GP approximation are drawn from the proposalprior π q , and intuitively, for these samples to be informative, we need them to cover the highprobability region of the posterior. As such a natural choice is to select π q to be close tothe posterior distribution. Certainly this cannot be done in one step, as the posterior is notknown in advance, and so in next section we introduce a scheme to adaptively identify theproposal prior and construct the IGPR model for a given posterior distribution. Here we discuss how to construct the proposal prior π q ( θ ). As is discussed in the previous section, computing a good proposal prior π q is ratherchallenging especially when the posterior is concentrated in a rather small region of the entireprior state space. One remedy to the issue is to use a simulated tempering approach [6].Namely we construct a sequence of “intermediate posterior distributions” with the ﬁrst onebeing the prior and the last one being the actual posterior: { π t ( θ | ˜ d ) } Tt =0 with π ( θ | ˜ d ) = π ( θ ) and π T ( θ | ˜ d ) = π ( θ | ˜ d ). Typically, the sequence of “intermediate posteriors” should be NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 9 constructed in a way that their variances are decreasing. One can see that the standardtempering/annealing approaches do not apply here as we do not have an explicit expression ofthe posterior distribution. Therefore, we use a special tempering strategy by creating artiﬁcialdata with larger observation noise. Namely we introduce a sequence of artiﬁcial data noise: η t ∼ N (0 , σ t I ) . Here the standard deviation σ t plays the role of the tempering parameter,and it should be chosen to be gradually decreasing with respect to t and σ T = 0. We thendeﬁne for any 1 ≤ t ≤ T , ˜ d ′ t = ˜ d + η t , and π t ( θ | ˜ d ) = π ( θ | ˜ d ′ t ) , and one can see that π T ( θ | ˜ d ) = π ( θ | ˜ d ′ T ) = π ( θ | ˜ d ). Other tempering approaches can also beused to design the posterior sequence as long as they satisfy the aforementioned conditions.Once a tempering scheme is chosen, the adaptive scheme proceeds as follows. For any1 ≤ t ≤ T , let φ t − ( θ ) be the independent Gaussian approximation of the posterior of theprevious step π t − ( θ | ˜ d ), and we then use φ t − ( θ ) as the proposal prior for the present stepto compute φ t ( θ ), the independent Gaussian approximation of π t ( θ | ˜ d ). As has been ex-plained above π t ( θ | ˜ d ) is essentially π ( θ | ˜ d ′ t ), and thus they key step here is to construct IGPRmodel for π ( θ | ˜ d ′ t ). Speciﬁcally, we ﬁrst generate the simulated data points { ˆ d i , θ i ) } mi =1 from φ t − ( θ ) π ( d | θ ); for each i = 1 ...m , we letˆ d ′ i = ˆ d i + η t , with η t ∼ N (0 , σ t I );the last step is to construct the IGPR model using the data set { ˆ d ′ i , θ i ) } mi =1 as describedearlier. The procedure is repeated until it reaches the last step t = T . It is important to notehere that, in Algorithms 3.1 and 3.2, it is assumed that the cut-oﬀ distance ǫ is prescribed, andhere we allow that ǫ t in each step is determined by choosing a certain portion ω of the samplesin terms of their distances to the data point ˜ d . By determining the cut-oﬀ distance this waywe can ensure that suﬃcient samples are used to construct the GP model. We conclude thissection by presenting the complete procudure of the adaptive IGPR in Alg. 3.3. In this section, we provide some analysis of the IGPRmodel with a proposal prior and in particular we study its asymptotic property with respectto both the cut-oﬀ threshold ǫ and the sample size m . Given a likelihood function π ( d | θ ) anda proposal prior π q ( x ), we can deﬁne a joint distribution π ǫ ( θ, d ) = π ( d | θ ) π q ( θ ) δ ǫ ( ρ ( d , ˜ d )) , and as mentioned earlier ρ ( · , · ) is taken to be the Euclidean distance. Note here that wedrop the superscript j in θ j without causing any ambiguity. Suppose that a training set D ǫ (˜ d ) = { ( θ i , d i ) } mi =1 is generated according to distribution π ǫ ( θ, d ), and it should be clearthat for any two points d i and d i ′ in D ǫ , we have ρ ( d i , ˜ d ) ≤ ǫ and ρ ( d i , d i ′ ) ≤ ǫ . Next westudy the property of the GP model constructed with D ǫ . First we introduce an assumptionon the kernel function k ( d , d ′ ) of the GP model: Assumption 3.1. (a): for any d ∈ R n d , k ( d , d ) = c where c is an arbitrary positive con-stant; (b): for any d ∈ R n d , and any υ > , there exists ǫ > , such that, for ∀ d ′ ∈ R n d satisfying ρ ( d , d ′ ) < ǫ , we have | k ( d , d ′ ) − k ( d , d ) | < υ . Algorithm 3.3

The adaptive IGPR algorithm Algorithm Parameters : m , T , ω , { σ t } Tt =1 ; Output : π post ( θ | ˜ d ); let φ ( θ ) = Q n θ j =1 N ( µ j , ( σ j ) ), where ( µ j , ( σ j ) ) are respectively the prior mean andvariance of x j ; for t = 1 to T do draw m samples { ˆ θ , ..., ˆ θ m } from the proposal prior φ t − ( θ ); for i = 1 to m do draw ˆ d i ∼ π ( ·| ˆ θ i ); let ˆ d ′ i = ˆ d i + η t with η t ∼ N (0 , σ t I ); calculate δ i = k ˆ d ′ i − ˜ d k ; end for let ǫ t be the ω -th quantile of { δ i } mi =1 ; for j = 1 n θ do let D j = { ( ˆ d ′ i , ˆ θ ji ) | δ i < ǫ t } mi =1 ; construct the GP model π GP ( θ j | d , D j ) from data set D j ; if σ j GP > σ j then let σ j GP = σ j ; end if end for let ψ t ( θ ) = Q n θ j =1 π GP ( θ j | ˜ d , D j ); compute φ t ( θ ) ∝ ψ t ( θ ) φ ( θ ) using Eqs. (3.6); end for let π post ( θ ) ∝ φ T ( θ ) π ( θ ) φ ( θ ) . Note that most popular kernel functions such as the squared exponential, exponential, and therational quadratic kernels all satisfy this assumption. We then have the following theorem:

Theorem 3.2.

For any ˜ d ∈ R n d , let π GP ( θ | ˜ d , D ǫ (˜ d )) be the GP model constructed with D ǫ (˜ d ) and evaluated at ˜ d , and let µ mǫ (˜ d ) and σ mǫ (˜ d ) be respectively the mean and standarddeviation of π GP ( θ | ˜ d , D ǫ (˜ d )) . Suppose that the kernel function of the GP model k ( · , · ) satisﬁesAssumption 3.1, we have (3.8) µ mǫ (˜ d ) d −−→ ǫ → µ m and µ m (˜ d ) a.s. −−−−→ m →∞ E θ | ˜ d [ θ ] . Proof.

See Appendix A.Loosely speaking, Theorem 3.2 states that the mean obtained by the IGPR model convergesto the actual posterior mean as ǫ approaches to zero and the number of samples increases.

4. Numerical examples.4.1. Overview of the examples.

In this section we provide four numerical examples todemonstrate the performance of the proposed IGPR method. In the ﬁrst example, we test theIGRP method on a one-dimensional toy problem, the main purpose of which is to demonstrate

NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 11

10 100 iterations m ean

10 100 iterations S T D -1 1 d -33

68% CIGPsamplesdata

Figure 1.

For the 1-D problem. Left: the GP model constructed by Alg. 3.1, where the solid line is theposterior mean predicted by GP, the shaded area is the conﬁdence interval (CI), and the asterisks arethe samples generated from the target likelihood function; right: the top ﬁgure is the predicted posterior meanand the bottom ﬁgure is the predicted posterior standard deviation (STD), both plotted against the number ofiterations. some properties of the method as well as to compare the basic and the adaptive versions of it.The other three examples concern real-world applications and they are all described by dy-namical systems coupling with model errors, which renders the likelihood function intractable.In these three examples, we compare the performance of IGPR with a representative approx-imate inference method, the Sequential Neural Likelihood (SNL) [20]. SNL is a samplingmethod which trains an autoregressive ﬂow on simulated data to approximate the likelihoodfunction, and we choose to compare with SNL because a thorough comparison of SNL withmany other popular methods has been conducted in [20].An important issue here is that, in all the three real-world examples the raw data thatare recorded are actually time-series signal, which means the dimensionality of the data isextremely high and thus is beyond the capacity of any method for approximate inference.A typically remedy for this matter is to construct some summary statistics of the high di-mensional time-series data, and conduct inference using these statistics. In these exampleswe construct the summary statistics following the method in [27]. Speciﬁcally the absolutevalue of the signal amplitude, velocity, acceleration, and the power spectral density (PSD)are extracted from the signal. The mean, variance, skewness, and kurtosis of each physicalquantity are computed as the features, resulting in a 16-dimensional data d used to infer themodel parameters θ . Moreover, for the purpose of performance evaluation, in all the examplesthe ground truth of these parameters are pre-determined and the data are simulated from theunderlying model accordingly. Finally, since the wall-clock computational cost is reported insome examples, it is useful to mention that all the experiments are conducted on a desktopcomputer with a 3.6GHz CPU and 16Gb memory. iterations -0.500.511.52 m ean iterations S T D -1 1 d -33

68% CIGPsamplesdata

Figure 2.

For the 1-D problem. Left: the GP model constructed by Alg. 3.3, where the solid line is theposterior mean predicted by GP, the shaded area is the conﬁdence interval (CI), and the asterisks arethe samples generated from the target likelihood function; right: the top ﬁgure is the predicted posterior meanand the bottom ﬁgure is the predicted posterior standard deviation (STD), both plotted against the number ofiterations.

Our ﬁrst example is a one-dimensional (1-D) toyproblem where the simulation model is(4.1) d = erf( θ + η ) , with η ∼ N (0 , . ), and erf( · ) being the error function. The prior distribution is uniform:U[ − , θ = 1 and the associated observation is d = 0 . d = 0 .

869 is indicated by the dashed verticalline in the ﬁgure. Also shown in the ﬁgure are the samples drawn from the likelihood function,which are largely distributed uniformly in the state space [ − ,

3] as they are drawn directlyfrom the prior U [ − , d . As a result, the variance of the GPmodel is reduced rather evenly in the state space. On the left of Fig. 2 we plot the predictedposterior mean and variance as a function of the number of iterations, in which we observeclear convergence of the mean and variance as the number of iterations increases.Next we test the adaptive IGPR method. With this method, we construct the IGPRmodel with 5 initial points, and 40 iterations with one sample in each iteration, resulting intotally 45 samples from the likelihood function. Note that in this toy problem we use all NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 13 the available samples to construct the GP, which is slightly diﬀerent from Alg. 3.3. In Fig. 2(left), we plot the posterior mean and variance (corresponding to the 68% conﬁdence interval)predicted by the IGPR method as well as the data points. First we can see here that, unlikethe samples in Fig. 1 that distributed uniformly, most of these samples are placed near thetrue data point, which demonstrate the eﬀectiveness of the adaptive scheme to select samplespoints. As a result, the variance predicted by the IGPR model in the area near the data pointis much smaller than that in areas that are far apart, which is also diﬀerent from Fig. 1 wherethe variance distribution is more even across the whole interval. The plots on the right showthe mean and variance with respect to the sample size, which shows a much faster convergencethan those in Fig. 1. Finally we report that posterior mean predicted by the adaptive IGPR(Algorithm 3.3) is 1 .

12 and the standard deviation is 0 . In terms of real-world applications, we ﬁrst considera simply metabolic pathway network, given by a stochastic dynamical system [26]˙ X = ( αX − . − β X . ) exp( ξ t ) , ˙ X = β X . − β X − X . , (4.2)where ξ t is a white Gaussian noise following ξ t ∼ N (0 , − ), and the initial condition is X (0) = 1 . X (0) = 1. In this model, X and X are two dependent metabolites and( α , β , β ) are three key parameters describing how X and X are coupled. These threeparameters can not be measured directly and need to be inferred from the system output. Inparticular we make measurements of the signal X ( t ) = X ( t ) + X ( t ), and as is mentionedearlier, 16 summary statistics are extracted to infer the model parameters. Since this is asynthetic example, the “true values” of the parameters are known and shown in Table 1.Moreover, we assume a uniform prior distribution on each of the parameters where the rangesof the priors are given in Table 1. Due to the presence of the model noise ξ t , the analyticalexpression of the likelihood function in this problem is not available, which makes usualposterior estimation methods infeasible.We implement IGPR and SNL to compute the posterior distributions of the parameters,and for the purpose of comparison, we use the same number iterations and the same numberof samples per iteration in both methods. Speciﬁcally we ﬁx the number of iterations to be T = 10, and use three diﬀerent sample sizes per iteration: m = 50, m = 100 and m = 200,resulting in 500, 1000 and 2000 samples in total respectively. In IGPR, we use all the samplesfor m = 50 and 100 since the sample size in each iteration is rather small, and ρ = 25% for m = 200. The tempering scheme is chosen to be σ t = 0 . T − t ) /T, which is also used in the next two examples. Since the posterior mean is a commonly usedestimator for the parameters, we evaluate the performance of the methods by calculating theestimation error between the posterior mean and the ground truth for each parameter. Werepeat the simulation 50 times for either method and compute the average estimation errorswhere the results are summarized in Table 2. The standard deviations (STD) of the estimationerrors are also shown in the table. One can see from the table that, with 500 samples in total, -0.2 0 0.2010203040 IGPSNLTrue -0.2 0 0.2010203040

IGPSNLTrue -0.2 0 0.2010203040

IGPSNLTrue

Figure 3.

For the metabolic pathway example. The marginal posterior distribution computed by IGPR andSNL. In plots, the red solid lines indicate the true parameter values. the results of IGPR are much less accurate than those of SNL, with 1000 samples it achievesbetter accuracy than SNL on α and β , and ﬁnally it produces substantially better results onall three parameters when the sample size becomes 2000. We also observe from the table that,for sample size 2000, the STDs of the estimation errors are also much smaller than those ofSNL, suggesting that the results of IGPR are subject to smaller variations statistically.The performance can also be compared by visualizing the posterior results. In Fig. 3we show the marginal posterior distributions computed by IGPR and SNL with 2000 totalsamples in one trial, where for SNL the distributions are obtained via kernel density estimationof the posterior samples. In the ﬁgure, we can see that IGPR produces evidently better results,conﬁrming the ﬁndings in Table 2. Next in Fig. 4, to demonstrate how the estimation resultsimprove as the iteration proceeds in IGPR, we plot the average estimation error in IGPR,against the number of iterations. It can be seen from the ﬁgure that, the errors on all threeparameters decay obviously as the number of iterations increases and it looks that the resultsbecome stable after 5 iterations. Table 1

For the metabolic pathway example. The ground truth and the prior distributions of the model parameters. parameter log( α ) log( β ) log( β )truth 0 0 0prior N ( − . , . N ( − . , . N ( − . , . Table 2

For the metabolic pathway example. Means and standard deviations of the estimation errors. The smallererrors of the two methods are shown in bold, and the numbers inside the parentheses are the standard deviations.

Sample-size Method log α log β log β

500 IGPR 0 . . . . . . m = 50) SNL . (0 . . (0 . . (0 . . (0 . . (0 . . . m = 100) SNL 0 . . . . . . . (0 . . (0 . . (0 . m = 200) SNL 0 . . . . . . NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 15 e rr o r Figure 4.

For the metabolic pathway example. The absolute error with respect to iterations computed byIGPR and SNL.

In this section we consider a chaotic ecological systemtested in [18]. It is dynamical system that models the adult blowﬂy populations [30], andinterestingly the system can produce chaotic behavior for some parameter settings. Speciﬁcallythe discretized dynamical system is the following, N τ +1 = P N τ − τ δ exp( − N t − τ /N ) e τ + N τ exp( ∂ǫ τ ) , where N τ is the population, e τ ∼ N (1 /σ p , /σ p ) and ǫ τ ∼ N (1 /σ d , /σ d ) are sources of noise,and τ δ is an integer. In total there are six model parameters ˜ θ = ( P, ∂, N , σ d , σ p , τ δ ) thatwe want to estimate. Like [18] we model the logarithmic value of θ , where the prescribedground truth and the prior distributions are both shown in Table 3. Time-series signal { N τ } is generated from model (4.4), and the data for inferring the model parameters are again the16 summary statistics of N τ as described in Section 4.1. Table 3

For the blowﬂy population example. The ground truth and the prior distributions of the model parameters. parameter log P log ∂ log P log σ d log σ p log τ δ truth 4 -1.4 6.5 0.25 0.5 2.8prior N (2 , ) N ( − . , . ) N (6 , . ) N ( − . , ) N ( − . , ) N (2 . , . ) Table 4

For the blowﬂy population example. Means and standard deviations of the estimation errors. The smallererrors of the two methods are shown in bold, and the numbers inside the parentheses are the standard deviations.

Sample-size Method log P log ∂ log P log σ d log σ p log τ δ . . . . . . . . . . . (0 . m = 100) SNL . (0 . . (0 . . . . (0 . . (0 . . . . (0 . . . . (0 . . (0 . . . . (0 . m = 500) SNL 0 . . . . . . . . . (0 . . . . (0 . . (0 . . (0 . . (0 . . . . (0 . m = 1000) SNL 0 . . . . . . . . . (0 . . . .

49 0 .

37 0 .

29 0 .

83 1 .

02 0 . In the numerical experiments, we also ﬁx the number of iterations to be T = 10 and test several diﬀerent per-iteration sample sizes: m = 100 ,

500 and 1000, for both IGPR and SNL.In IGPR, ρ is chosen so that, for m = 100, all samples are used to construct the GP model,and for m = 500 and 1000, 200 samples are used. Once again we test both methods for 50times and summarize the average estimation errors and their STD in Table 4. First we cansee from the table that the estimation errors of both methods are quite small compared tothe prior. Second, similar to the ﬁrst example, SNL seems to have a better performance for m = 100, and as the sample size increases, IGPR becomes more accurate with respect to theestimator error. More precisely for m = 1000 (total sample size 10,000), IGPR yields smallerestimation errors in all but one parameters, and more importantly, in this case, IGPR resultsin much smaller STD for all the six parameters, which, once again, demonstrates that IGPRprovides more statistically stable estimates. We also provide in the table the estimation errorsof a single SMC-ABC simulation with more than 30,000 samples, and one can see that theerrors are one order of magnitude larger than those of the other two methods; since SMC-ABCis highly expensive, repeated simulations are not conducted.We also plot the marginal posterior distributions obtained by both methods with 10,000samples in Fig. 5 to provide some visualized comparison, and the ﬁgure does show that bothmethods actually perform well for this example, with some minor diﬀerences on diﬀerentdimensions. In addition to the estimation accuracy, we are also interested the computationalcost of the two methods, and thus we show in Fig. 6 the wall-clock time of the methods,plotted as a function of sample size. One can see from the ﬁgures that, IGPR is substantiallymore eﬃcient than SNL and the cost also increases much slower than SNL with respect tosample size. For example, for m = 100, the time cost of SNL is 164 seconds and that ofIGPR is 1.5 seconds, while for m = 1000, the cost of SNL and IGPR are 950 and 2.0 secondsrespectively. We note here that the lower computational cost of IGPR is partially due to useof the local GP model which only employs a small portion of the total samples. We shouldemphasize that an important trade-oﬀ for IGPR is that it can only calculate the marginalposterior distributions of the parameters while SNL can obtain the joint one. Our last example is the Hodgkin-Huxley (HH) model thatis also used as an application example in [20]. The HH model describes how the electricalpotential measured across the cell membrane of a neuron varies over time as a function ofcurrent injected through an electrode. In particular we use the mathematical model [21]which consists of ﬁve coupled ordinary diﬀerential equations, and is solved numerically usingNEURON [7]. The model equations as well as the parameters are detailed in Appendix B,and our goal is to estimate 12 model parameters (see Appendix B for further details):( g ℓ , ¯ g Na , ¯ g K , ¯ g M , E ℓ , E Na , E K , V T , k β n1 , k β n2 , τ max , σ ) , from some measured signal. As is explained in [20], this problem is both mathematicallychallenging and of practical interest in neuroscience[9, 16].The data used for inference is the voltage V recorded for 100 ms, generating a time seriesof 4001 voltage recordings, and as usual the 16 prescribed statistics are extracted and used asthe data ˜ d . There are 12 unknown model parameters which are redeﬁned as θ = ( θ , . . . , θ )as is shown in Table 5. The ground truth of these parameters and the priors used in thisexample are also provided in Table 5. NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 17

IGPSNLTrue -2.5 -2 -1.5 -1 -0.501234

IGPSNLTrue

IGPSNLTrue -2 0 200.511.522.5

IGPSNLTrue -2 0 200.511.52

IGPSNLTrue

Figure 5.

For the blowﬂy population example. The marginal posterior distribution computed by IGPR andSNL with , sample size. In both plots, the red solid lines indicate the true parameter values. Sample-size T i m e ( s ) SNLIGPR

Sample-size T i m e ( s ) SNLIGPR

Figure 6.

For the blowﬂy population example. Time cost (in seconds) of SNL and IGPR with respect todiﬀerent sample sizes: the left ﬁgure is on a linear scale and the right one is on a logarithmic scale.

The setup of our numerical experiments is the same as that in Section 4.4 and is notrepeated here. The average estimation errors of IGPR and SNL are compared in Table 6, andin addition, we plot the marginal posteriors with 10,000 samples obtained in one trial in Fig. 7.From the table we can see that errors in IGPR is either comparable to or evidently smallerthan that of SNL, and in fact the result of θ with 10,000 samples is the only case whereIGPR yields larger (yet still comparable) error than SNL. The marginal distribution plotswith 10,000 samples largely show the similar behaviors, where one can see that the posterior For the HH model example. The ground truth and the prior distributions of the model parameters. parameter θ = log g ℓ θ = log ¯ g Na θ = log ¯ g K θ = log ¯ g M truth -4.60 2.99 1.60 -4.96prior U [ − . − . U [2 .

30, 3 . U [0 . . U [ − . − . θ = log − E ℓ θ = log E Na θ = log − E K θ = log − V T truth 4.24 3.91 4.60 4.09prior U [3 . . U [3 . . U [3 .

91, 5 . U [3 .

40, 4 . θ = log k β n1 θ = log k β n2 θ = log τ max θ = log σ truth -0.69 3.68 6.90 0prior U [ − . − . U [3 .

00, 4 . U [6 . . U [ − . . Table 6

For the HH model example. Means and standard deviations of the estimation errors in the HH model. Thesmaller errors of the two methods are shown in bold, and the numbers inside the parentheses are the standarddeviations.

Sample-size Method log g ℓ log ¯ g Na log E Na log ¯ g M . . . (0 . . (0 . . . m = 100) SNL . (0 . . . . . . (0 . . (0 . . (0 . . (0 . . . m = 500) SNL 0 . . . . . . . (0 . . (0 . . . . (0 . . . m = 1000) SNL 0 . . . . . . . . − E ℓ log E Na log − E K log − V T . (0 . . (0 . . . . (0 . m = 100) SNL 0 . . . . . (0 . . . . (0 . . (0 . . (0 . . (0 . m = 500) SNL 0 . . . . . . . . . (0 . . (0 . . (0 . . (0 . m = 1000) SNL 0 . . . . . . . . k β n log k β n log τ max log σ . (0 . . (0 . . (0 . . (0 . m = 100) SNL 0 . . . . . . . . . (0 . . . . (0 . . (0 . m = 500) SNL 0 . . . (0 . . . . . . (0 . . . . (0 . . (0 . m = 1000) SNL 0 . . . (0 . . . . . mean of IGPR is evidently closer to the truth for several parameters such as θ , θ , θ and θ . Moreover IGPR also result in smaller error STD than SNL in all the cases with 5,000or 10,000 samples, suggesting that IGPR is more statistically stable in this example. Finallywe compare the computational cost for both methods by plotting their wall-clock time costin Fig. 8: the plots show that the computational time of SNL grows signiﬁcantly faster thanIGPR and at sample size 10,000, the time cost for SNL is 1504 seconds while that for IGPRis 12, indicating substantial advantage of IGPR in terms of computational eﬃciency.

5. Conclusions.

In conclusion, we have proposed an eﬃcient method for Bayesian infer-ence problems with intractable likelihood functions. The method constructs a GP regressionfrom the output of the underlying simulation model to the input of it, which leads to an

NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 19 -5.2 -5 -4.8 -4.6 -4.4 -4.200.511.52

IGPSNLTrue

IGPSNLTrue -5.6 -5.4 -5.2 -5 -4.8 -4.600.511.5

IGPSNLTrue

IGPSNLTrue -1.2 -1 -0.8 -0.6 -0.402468

IGPSNLTrue

IGPSNLTrue -0.6 -0.4 -0.2 0 0.2 0.4051015

IGPSNLTrue

Figure 7.

For the HH model example. The marginal posterior distribution computed by IGPR and SNLwith , sample size. In both plots, the red solid lines indicate the true parameter values. Sample-size T i m e ( s ) SNLIGPR

Sample-size T i m e ( s ) SNLIGPR

Figure 8.

For the HH model example. Time cost (in seconds) of SNL and IGPR with respect to diﬀerentsample sizes: the left ﬁgure is on a linear scale and the right one is on a logarithmic scale. approximation of the marginal posterior distribution. With both mathematical and practicalexamples, we illustrate that the proposed method can be a competitive tool to approximatelycompute the posterior distributions especially for problems with stringent time constraints.Finally we discuss some limitations of the proposed method as well as some issues that needto be address in the future. Obviously the approximate posterior distribution obtained by themethod is a weighted Gaussian, which thus requires that the true posterior distribution shouldnot deviate too much from Gaussian. Strongly non-Gaussian posteriors such as multimodaldistributions may cause problems for the proposed method. Second only marginal posteriorcan be estimated by IGPR method and therefore information lying in the joint distribution such as the correlation between parameters can not be obtained. To this end, we believe themulti-output GP methods [5] may provide a viable path to directly approximating the jointposterior distribution. We plan to address these issues in future studies.’

Appendix A. Proof of Theorem 3.2.

We ﬁrst prove two lemmas. The ﬁrst lemma considers the convergence with respect to ǫ .For convenience sake, for a given training set D ǫ (˜ d ), we deﬁne X ǫ = { d i | ( d i , θ i ) ∈ D ǫ (˜ d ) } and Y ǫ = { θ i | ( d i , θ i ) ∈ D ǫ (˜ d ) } to denote the input and output data points in the training set. Lemma A.1.

Let π GP ( θ | ˜ d , D ǫ (˜ d )) be the GP model constructed with D ǫ (˜ d ) , evaluated at ˜ d for ∀ ˜ d ∈ R n d , and let µ mǫ (˜ d ) and σ n d ǫ (˜ d ) be respectively the mean and standard deviation of π GP ( θ | ˜ d , D ǫ (˜ d )) . Suppose that the kernel function of the GP model k ( · , · ) satisﬁes Assumption3.1, we have, (A.1) lim ǫ → µ mǫ (˜ d ) d −−→ ǫ → µ m (˜ d ) . Proof.

Using Assumption 3.1 it is easy to show that for any υ > ǫ >

0, suchthat for any d i , d j ∈ D ǫ (˜ d ),(A.2) | k (˜ d , d i ) − c | < υ, and | k ( d i , d j ) − c | < υ, which implies a pointwise convergence,lim ǫ → K (˜ d , X ǫ )( K ( X ǫ , X ǫ ) + σ ζ I ) − = K (˜ d , X )( k ( X , X ) + σ ζ I ) − , thanks to the continuity of matrix inversion. By deﬁnition we have Y ǫ → Y in distribution,and it then follows that,(A.3) µ mǫ ( d ) = µ (˜ d ) + K (˜ d , X ǫ )( K ( X ǫ , X ǫ ) + σ ζ I ) − ( Y ǫ − µ (˜ d ) m ) d −−→ ǫ → µ (˜ d ) + K (˜ d , X )( k ( X , X ) + σ ζ I ) − ( Y − µ (˜ d ) m )= µ m (˜ d ) . Lemma A.2.

Let π GP ( θ | ˜ d , D (˜ d )) be the GP model constructed with D (˜ d ) , evaluated at ˜ d for ∀ ˜ d ∈ R n d , and let µ m (˜ d ) and σ m (˜ d ) be respectively the mean and standard deviation of π GP ( θ | ˜ d , D (˜ d )) . Suppose that the kernel function of the GP model k ( · , · ) satisﬁes Assumption3.1, and we have, (A.4) µ m ( d ) a.s. −−−−→ m →∞ E θ | ˜ d [ θ ] . Proof. As ǫ = 0, we have d i = ˜ d for ∀ d i ∈ D (˜ d ), and it follows that K ( d , X ) = c m ,and K ( X , X ) = c m × m where I m × m is a m × m matrix of ones. From Eq. (2.8b), we have(A.5) µ m (˜ d ) = µ (˜ d ) + K (˜ d , X )( k ( X , X ) + νI ) − ( Y − µ (˜ d ) m )= µ (˜ d ) + c Tm ( c m × m + νI ) − ( Y − µ (˜ d ) m )= µ (˜ d ) + c Tm ( c m Tm + νI ) − ( Y − µ (˜ d ) m ) NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 21 and applying the Sherman–Morrison formula to the equation above yielding,(A.6) µ m (˜ d ) = µ (˜ d ) + c Tm ( ν − I − cν − m Tm cmν − )( Y − µ (˜ d ) m )= µ (˜ d ) + cν − Tm cmν − ( Y − µ (˜ d ) m )= cν − cmν − Tm Y + 11 + cmν − µ (˜ d )= cmν − cmν − m m X i =1 θ i ! + 11 + cmν − µ (˜ d ) , where { θ i } mi =1 are i.i.d. samples following the distribution π ( θ | ˜ d ). Thus by the strong law oflarge numbers we obtain,(A.7) µ m (˜ d ) a.s. −−−−→ m →∞ [ E ] θ | ˜ d [ θ ] . Theorem 3.2 is a direct consequence of Lemmas A.1 and A.2.

Appendix B. Description of the HH model.

In this section we describe the equationsof the HH model. The main membrane equation is:(B.1) C m dVdt = − I l − I Na − I K − I M − I e , where C m = 1, and we need to specify the terms I l , I Na , I K , I M and I e in the equation. First I l is in the form of (with parameters g l , E l ):(B.2) I l = g l ( V − E l ) . Next the equations for I Na , with parameters ¯ g Na , E Na and V T , are, I Na = ¯ g Na m h ( V − E Na ) , d m d t = α m ( V )(1 − m ) − β m ( V ) m, d h d t = α h ( V )(1 − h ) − β h ( V ) h,α m = − .

32 ( V − V T − − ( V − V T − / − ,β m = 0 .

28 ( V − V T − V − V T − / − ,α h = 0 .

128 exp [ − ( V − V T − / ,β h = 41 + exp [ − ( V − V T − / . The equations for I K depend on four parameters: ¯ g K , E K , β n1 , β n2 , and are given by, I K = ¯ g K n ( V − E K ) , d n d t = α n ( V )(1 − n ) − β n ( V ) n,α n = − .

032 ( V − V T − − ( V − V T − / − ,β n = β n1 exp [ − ( V − V T − /β n2 ] . The term I M which depends on two parameters ¯ g M and τ max , is given as the following, I M = ¯ g M p ( V − E K ) , d p d t = ( p ∞ ( V ) − p ) /τ p ( V ) ,p ∞ ( V ) = 11 + exp[ − ( V + 35) / ,τ p ( V ) = τ max . V + 35) /

20] + exp[ − ( V + 35) / . Finally I e is speciﬁed as 110 I e ∼ N ( − . , σ ) , with a parameter σ . The initial condition is set to be(B.3) m (0) = h (0) = n (0) = p (0) = 0 and V (0) = − . The physical meanings of the variables and parameters can be found in [21].

REFERENCES [1]

L. Acerbi , Variational bayesian monte carlo , in Advances in Neural Information Processing Systems,2018, pp. 8213–8223.[2]

C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan , An introduction to MCMC for machinelearning , Machine learning, 50 (2003), pp. 5–43.[3]

M. A. Beaumont, W. Zhang, and D. J. Balding , Approximate bayesian computation in populationgenetics , Genetics, 162 (2002), pp. 2025–2035.[4]

D. M. Blei, A. Kucukelbir, and J. D. McAuliffe , Variational inference: A review for statisticians ,Journal of the American Statistical Association, 112 (2017), pp. 859–877.[5]

P. Boyle and M. Frean , Dependent gaussian processes , Advances in neural information processingsystems, 17 (2004), pp. 217–224.[6]

S. Brooks, A. Gelman, G. Jones, and X.-L. Meng , Handbook of markov chain monte carlo , CRCpress, 2011.[7]

N. T. Carnevale and M. L. Hines , The NEURON book , Cambridge University Press, 2006.[8]

I.-C. Chou and E. O. Voit , Recent developments in parameter estimation and structure identiﬁcationof biochemical and genomic systems , Mathematical biosciences, 219 (2009), pp. 57–83.[9]

A. C. Daly, D. J. Gavaghan, C. Holmes, and J. Cooper , Hodgkin–huxley revisited: reparametrizationand identiﬁability analysis of the classic action potential model with approximate bayesian methods ,Royal Society open science, 2 (2015), p. 150499.

NVERSE GAUSSIAN PROCESS REGRESSION FOR LIKELIHOOD-FREE INFERENCE 23 [10]

C. W. Fox and S. J. Roberts , A tutorial on variational bayesian inference , Artiﬁcial intelligence review,38 (2012), pp. 85–95.[11]

A. Gelman, J. B. Carlin, H. S. Stern, D. Dunson, A. Vehtari, and D. B. Rubin , Bayesian dataanalysis (3rd edition) , Chapman & Hall/CRC, 2013.[12]

R. B. Gramacy and D. W. Apley , Local gaussian process approximation for large computer experiments ,Journal of Computational and Graphical Statistics, 24 (2015), pp. 561–578.[13]

K. Kandasamy, J. Schneider, and B. Poczos , Bayesian active learning for posterior estimation ,in Proceedings of the 24th International Conference on Artiﬁcial Intelligence, AAAI Press, 2015,pp. 3605–3611.[14]

M. C. Kennedy and A. O’Hagan , Bayesian calibration of computer models , Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 63 (2001), pp. 425–464.[15]

G. Lillacci and M. Khammash , Parameter estimation and model selection in computational biology ,PLoS Comput Biol, 6 (2010), p. e1000696.[16]

J.-M. Lueckmann, P. J. Goncalves, G. Bassetto, K. ¨Ocal, M. Nonnenmacher, and J. H. Macke , Flexible statistical inference for mechanistic models of neural dynamics , in Advances in neural infor-mation processing systems, 2017, pp. 1289–1299.[17]

P. Marjoram, J. Molitor, V. Plagnol, and S. Tavar´e , Markov chain monte carlo without likelihoods ,Proceedings of the National Academy of Sciences, 100 (2003), pp. 15324–15328.[18]

E. Meeds and M. Welling , Gps-abc: Gaussian process surrogate approximate bayesian computation , inProceedings of the Thirtieth Conference on Uncertainty in Artiﬁcial Intelligence, UAI’14, Arlington,Virginia, USA, 2014, AUAI Press, pp. 593–602.[19]

G. Papamakarios and I. Murray , Fast ε -free inference of simulation models with bayesian conditionaldensity estimation , in Proceedings of the 30th International Conference on Neural Information Pro-cessing Systems, 2016, pp. 1036–1044.[20] G. Papamakarios, D. Sterratt, and I. Murray , Sequential neural likelihood: Fast likelihood-freeinference with autoregressive ﬂows , in The 22nd International Conference on Artiﬁcial Intelligenceand Statistics, PMLR, 2019, pp. 837–848.[21]

M. Pospischil, M. Toledo-Rodriguez, C. Monier, Z. Piwkowska, T. Bal, Y. Fr´egnac,H. Markram, and A. Destexhe , Minimal hodgkin–huxley type models for diﬀerent classes of corticaland thalamic neurons , Biological cybernetics, 99 (2008), pp. 427–441.[22]

J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn , Design and analysis of computer experi-ments , Statistical science, (1989), pp. 409–423.[23]

S. A. Sisson, Y. Fan, and M. Beaumont , Handbook of approximate Bayesian computation , Chapmanand Hall/CRC, 2018.[24]

S. A. Sisson, Y. Fan, and M. M. Tanaka , Sequential monte carlo without likelihoods , Proceedings ofthe National Academy of Sciences, 104 (2007), pp. 1760–1765.[25]

O. Thomas, R. Dutta, J. Corander, S. Kaski, M. U. Gutmann, et al. , Likelihood-free inferenceby ratio estimation , Bayesian Analysis, (2020).[26]

E. O. Voit and J. Almeida , Decoupling dynamical systems for pathway identiﬁcation from metabolicproﬁles , Bioinformatics, 20 (2004), pp. 1670–1681.[27]

H. Wang and J. Li , Adaptive gaussian process approximation for bayesian inference with expensivelikelihood functions , Neural computation, 30 (2018), pp. 3072–3094.[28]

C. K. Williams and C. E. Rasmussen , Gaussian processes for machine learning , MIT Press, 2006.[29]

S. N. Wood , Statistical inference for noisy nonlinear ecological dynamic systems , Nature, 466 (2010),pp. 1102–1104.[30]

S. N. Wood , Statistical inference for noisy nonlinear ecological dynamic systems , Nature, 466 (2010),pp. 1102–1104.[31]