[PDF] Approximate Inference for Observation Driven Time Series Models with Intractable Likelihoods

Abstract

In the following article we consider approximate Bayesian parameter inference for observation driven time series models. Such statistical models appear in a wide variety of applications, including econometrics and applied mathematics. This article considers the scenario where the likelihood function cannot be evaluated point-wise; in such cases, one cannot perform exact statistical inference, including parameter estimation, which often requires advanced computational algorithms, such as Markov chain Monte Carlo (MCMC). We introduce a new approximation based upon approximate Bayesian computation (ABC). Under some conditions, we show that as n→∞ , with n the length of the time series, the ABC posterior has, almost surely, a maximum \emph{a posteriori} (MAP) estimator of the parameters which is different from the true parameter. However, a noisy ABC MAP, which perturbs the original data, asymptotically converges to the true parameter, almost surely. In order to draw statistical inference, for the ABC approximation adopted, standard MCMC algorithms can have acceptance probabilities that fall at an exponential rate in n and slightly more advanced algorithms can mix poorly. We develop a new and improved MCMC kernel, which is based upon an exact approximation of a marginal algorithm, whose cost per-iteration is random but the expected cost, for good performance, is shown to be O( n 2 ) per-iteration.

Full PDF

AApproximate Inference for Observation Driven Time Series Modelswith Intractable Likelihoods

BY AJAY JASRA , NIKOLAS KANTAS , & ELENA EHRLICH Department of Statistics & Applied Probability, National University of Singapore, Singapore, 117546, SG.E-Mail: [email protected] Department of Statistical Science, University College London, London, WC1E 6BT, UK.E-Mail: [email protected] Department of Mathematics, Imperial College London, London, SW7 2AZ, UK.E-Mail: [email protected]

Abstract

In the following article we consider approximate Bayesian parameter inference for observation driven time se-ries models. Such statistical models appear in a wide variety of applications, including econometrics and appliedmathematics. This article considers the scenario where the likelihood function cannot be evaluated point-wise;in such cases, one cannot perform exact statistical inference, including parameter estimation, which often re-quires advanced computational algorithms, such as Markov chain Monte Carlo (MCMC). We introduce a newapproximation based upon approximate Bayesian computation (ABC). Under some conditions, we show that as n → ∞ , with n the length of the time series, the ABC posterior has, almost surely, a maximum a posteriori (MAP) estimator of the parameters which is diﬀerent from the true parameter. However, a noisy ABC MAP,which perturbs the original data, asymptotically converges to the true parameter, almost surely. In order todraw statistical inference, for the ABC approximation adopted, standard MCMC algorithms can have acceptanceprobabilities that fall at an exponential rate in n and slightly more advanced algorithms can mix poorly. Wedevelop a new and improved MCMC kernel, which is based upon an exact approximation of a marginal algo-rithm, whose cost per-iteration is random but the expected cost, for good performance, is shown to be O ( n )per-iteration. We implement our new MCMC kernel for parameter inference from models in econometrics. Key Words:

Observation Driven Time Series Models, Approximate Bayesian Computation, Asymptotic Con-sistency, Markov Chain Monte Carlo.

Observation driven time-series models, introduced by [4], has a wide variety of real applications, including economet-rics (GARCH models) and applied mathematics (inferring initial conditions and parameters of ordinary diﬀerentialequations). The model can be described as follows. We observe { Y k } k ∈ N , Y k ∈ Y which are associated to a dynamicsystem { X k } k ∈ N , X k ∈ X which is potentially unknown. Deﬁne the process { Y k , X k } n ≥ (with y some arbitrarypoint on Y ) on a probability space (Ω , F , P θ ), where, for every θ ∈ Θ ⊆ R d θ , P θ is a probability measure. Denoteby F k = σ ( { Y n , X n } ≤ n ≤ k ). The model is deﬁned as, for k ∈ N P ( Y k +1 ∈ A | F k ) = (cid:90) A H θ ( x k , dy ) A × X ∈ F X k +1 = Φ θ ( X k , Y k +1 ) P θ ( X ∈ B ) = (cid:90) B Π θ ( dx ) Y × B ∈ F where H : Θ × X × σ ( Y ) → [0 , × X × Y → X and for every θ ∈ Θ, Π θ ∈ P ( X ) (the probabilities on X ).Throughout, we assume that for any ( x, θ ) ∈ X × Θ H θ ( x, · ) admits a density w.r.t. some σ − ﬁnite measure µ , whichwe denote as h θ ( x, y ). Next, we deﬁne a prior probability distribution Ξ on (Θ , B (Θ)), with Lebesgue density ξ .Thus, given n observations y n := ( y , . . . , y n ) the object of inference is the posterior distribution on Θ × X :Π( d ( θ, x ) | y n ) ∝ (cid:32) n (cid:89) k =1 h θ (Φ θ ( y k − )( x ) , y k ) (cid:33) Π θ ( dx ) ξ ( θ ) dθ (1)1 a r X i v : . [ s t a t . C O ] M a r here we have used the notation Φ θ ( y k − )( x ) = Φ θ ◦ · · · ◦ Φ θ ( x , y ) and dθ is Lebesgue measure. In mostapplications of practical interest, one cannot compute the posterior point-wise and has to resort to numericalmethods, such as MCMC, to draw inference on θ and/or x .In this article, we are not only interested in inferring the posterior distribution, but the scenario for which h θ ( x, y ) cannot be evaluated point-wise, nor do we have access to an unbiased estimate of it (it is assumed we cansimulate from the associated distribution). In such a case, it is not possible to draw inference from the true posterior,even using numerical techniques. The common response in Bayesian statistics, is now to adopt an approximationof the posterior using the notion of approximate Bayesian computation (ABC); see [13] for a recent overview. ABCapproximations of posteriors are based upon deﬁning a probability distribution on an extended state-space, withthe additional random variables lying on the data-space and usually distributed according the true likelihood. Thecloseness of the ABC posterior distribution is controlled by a tolerance parameter (cid:15) > (cid:15) → n are the true parameters. The new ABC approximation that we develop is studied from atheoretical perspective. Relying on the recent work of [6] we show that, under some conditions, as n → ∞ , with n the length of the time series, the ABC posterior has, almost surely, a MAP estimator of θ which is diﬀerent from thetrue parameter θ ∗ say. However, a noisy ABC MAP of θ asymptotically converges to the true parameter, almostsurely. These results establish that the particular approximation adopted is reasonably sensible.The other main contribution of this article is a development of a new MCMC algorithm designed to samplefrom the ABC approximation of the posterior. Due to the nature of the ABC approximation it is easily seen thatstandard MCMC algorithms (e.g. [12]) will have an acceptance probability that will fall at an exponential rate in n .In addition, more advanced ideas such as those based upon the ‘pseudo marginal’ [3], have recently been shown toperform rather poorly in theory; see [11]. These latter algorithms are based upon exact approximations of marginalalgorithms [1, 2], which in our context is just sampling θ, x . We develop an MCMC kernel, related to recent workin [10], which is designed to have a random running time per-iteration, with the idea of improving the explorationability of the Markov chain. We show that the expected cost per iteration of the algorithm, under some assumptionsand for reasonable performance, is O ( n ), which compares favourably with competing algorithms. We also show,empirically, that this new MCMC method out-performs standard pseudo marginal algorithms.This paper is structured as follows. In Section 2 we introduce our ABC approximation and give our theoreticalresults on the MAP estimator. In Section 3, we give our new MCMC algorithm, along with some theoreticaldiscussion about its computational cost and stability. In Section 4 our approximation and MCMC algorithm isillustrated on toy and real examples. In Section 5 we conclude the article with some discussion of future work. Theproofs of our theoretical results are given in the appendix. As it was emphasised in Section 1, we are interested in performing inference when h θ ( x, y ) cannot be evaluated point-wise, nor do we have access to an unbiased estimate of it. We will instead assume it is possible to sample from h θ .In such scenaria, one cannot use standard simulation based methods. For example, in a standard MCMC approachthe Metropolis-Hastings acceptance ratio cannot be evaluated, even though it may be well-deﬁned. Following thework in [5, 8] for hidden Markov models, we introduce an ABC approximation for the density of the posterior in(1) as follows: π (cid:15)n ( θ, x | y n ) ∝ n (cid:89) k =1 h θ,(cid:15) (cid:0) Φ θ ( y k − ) ( x ) , y k (cid:1) ξ ( x , θ ) , (2)with (cid:15) > h θ,(cid:15) (Φ θ ( y k − )( x ) , y k ) = (cid:82) B (cid:15) ( y k ) h θ (Φ θ ( y k − )( x ) , y ) µ ( dy ) µ ( B (cid:15) (0)) , (3)where we denote B (cid:15) ( y ) as the open ball centred at y with radius (cid:15) and write µ ( B (cid:15) ( y )) = (cid:82) B (cid:15) ( y ) µ ( dx ). When µ isthe Lebesgue measure, µ ( B (cid:15) ( y )) corresponds to the volume of the ball B (cid:15) ( y ).2n general we will refer to ABC as the procedure of performing inference for the posterior in (2). In addition,we will call noisy ABC the inference procedure that uses instead of the original observation sequence a perturbedone, namely (cid:110) ˆ Y k (cid:111) k ≥ , where each ˆ Y k is given by ˆ Y k = Y k + (cid:15)Z k , with each Z k is identically independently distributed (i.i.d.) uniformly on B (cid:15) (0) (shorthand Z k ∼ U B (cid:15) (0) ). In this section we will investigate some interesting properties of the ABC posterior in (2). In particular, we will lookat the asymptotic behaviour with n of the resulting MAP estimators for θ . The properties of the MAP estimatorreveal information about the mode of the posterior distribution as we obtain increasingly more data. To simplifythe analysis in this section we will assume that:( A1 ) – x is ﬁxed and known, i.e. Π( dx ) = δ x ( dx ), where δ x denotes the Dirac delta measure on X and x ∈ X is known. – ξ ( x, · ) is bounded and positive everywhere in Θ. – the observations actually originated from the true model model for some θ ∗ ∈ Θ, i.e. we look at awell-speciﬁed problem. – H and h do not depend upon θ . Thus we have the following model recursions for the true model: P θ ∗ ( Y k +1 ∈ A | F k ) = (cid:90) A H ( x k , dy ) , A × X ∈ F ,X k +1 = Φ θ ∗ ( X k , Y k +1 ) , (4)where we will denote associated expectations to P θ ∗ as E θ ∗ .In addition, for this section we will introduce some extra notations: ( X , d ) is a compact, complete and separablemetric space and (Θ , d ) is a compact metric space, with Θ ⊂ R d θ . For two measures ν and λ of bounded variationdenote the convolution ν (cid:63)λ ( f ) = (cid:82) f ( z + r ) ν ( dz ) λ ( dr ). Let also Q (cid:15) be the probability law associated to the randomsequence { Z k } k ∈ Z , where each Z k is an i.i.d. sample from the uniform distribution deﬁned on B (cid:15) (0).We proceed with some additional technical assumptions:( A2 ) { X k , Y k } k ∈ Z is a stationary stochastic process, with { Y k } k ∈ Z strict sense stationary and ergodic, following (4).( A3 ) For every ( x, y ) ∈ X × Y , θ (cid:55)→ Φ θ ( x, y ) is continuous. In addition, there exist 0 < C < ∞ such that for any( x, x (cid:48) ) ∈ X , sup y ∈ Y | h ( x, y ) − h ( x (cid:48) , y ) | ≤ Cd ( x, x (cid:48) ). Finally 0 < h ≤ h ( x, y ) ≤ h < ∞ , for every ( x, y ) ∈ X × Y .( A4 ) There exist a measurable (cid:37) : Y → (0 , θ, y, x, x (cid:48) ) ∈ Θ × Y × X d (Φ θ ( x, y ) , Φ θ ( x (cid:48) , y )) ≤ (cid:37) ( y ) d ( x (cid:48) x (cid:48) ) . ( A5 ) The following statements hold:1. H ( x, · ) = H ( x (cid:48) , · ) if and only if x = x (cid:48)

2. If Φ θ ( Y −∞ :0 )( x ) = Φ θ (cid:48) ( Y −∞ :0 )( x ) holds P θ ∗ (cid:63) Q (cid:15) − a.s., then θ = θ (cid:48) .Assumptions (A2-5) and the compactness of Θ are standard assumptions for maximum likelihood estimation (ML)and they can be used to show the uniqueness of the maximum likelihood estimator (MLE); see [6] for more details.Therefore, if the prior ξ ( θ ) is bounded and positive everywhere on Θ it is a simple corollary that the MAP estimatorwill correspond to the MLE. In the remaining part of this section we will adapt the analysis in [6] for MLE to theABC setup.In particular, we are to estimate θ using the log-likelihood function: l θ,x ( y n ) := 1 n n (cid:88) k =1 log (cid:16) h (cid:15) (Φ θ ( y k − )( x ) , y k ) (cid:17) We deﬁne the ABC-MLE for an n -long sequence as θ n,x,(cid:15) = arg max θ ∈ Θ l θ,x ( y n ) . We proceed with the following proposition: 3 roposition 2.1.

Assume (A1-4). Then for every x ∈ X and ﬁxed (cid:15) > n →∞ d ( θ n,x,(cid:15) , Θ ∗ (cid:15) ) = 0 P θ ∗ − a.s. where Θ (cid:15) = arg max θ ∈ Θ E θ ∗ [log( h (cid:15) (Φ θ ( Y −∞ :0 )( x ) , Y ))] . The result establishes that the estimate will converge to a point, which is typically diﬀerent to the true parameter.Hence there is an intrinsic asymptotic bias for the plain ABC procedure. To correct this bias, consider the noisyABC procedure, of replacing the observations by ˆ Y k = Y k + (cid:15)Z k , where Z k i.i.d. ∼ U B (0) . The noisy ABC MLEestimator is then: ˆ θ n,x,(cid:15) = arg max θ ∈ Θ n n (cid:88) k =1 log (cid:16) h (cid:15) (Φ θ (ˆ y k − )( x ) , ˆ y k ) (cid:17) . We have the following result:

Proposition 2.2.

Assume (A1-5). Then for every x ∈ X and ﬁxed (cid:15) > n →∞ d (ˆ θ n,x,(cid:15) , θ ∗ ) = 0 P θ ∗ (cid:63) Q (cid:15) − a.s.. The result shows that the noisy ABC MLE estimator is asymptotically unbiased. Therefore, given that in oursetup the ABC MAP estimator corresponds to the ABC MLE we can conclude that the mode of the posteriordistribution as we obtain increasingly more data is converging towards the true parameter. Finally we note thatour assumptions indeed pose some restrictions, but these are shown to be realistic for a few interesting models in[6]. In addition, the main purpose of this result is to motivate the use of the approximate posterior in (2) when theobservation sequence is long or its marginal likelihood is quite informative.

Recall that we formulated in the ABC posterior written in (5). One can rewrite the approximate posterior in (2): π (cid:15) ( θ, x | y n ) = p (cid:15)θ,x ( y n ) ξ ( x , θ ) (cid:82) p (cid:15)θ,x ( y n ) ξ ( x , θ ) dx dθ , with p (cid:15)θ,x ( y n ) = (cid:90) n (cid:89) k =1 I B (cid:15) ( y k ) ( u k ) µ ( B (cid:15) (0)) h θ (Φ θ ( y k − )( x ) , u k ) du n . Note we have just used Fubini’s theorem to rewritte the likehood p (cid:15)θ,x ( y n ) as an integral of a product instead ofa product of integrals (cid:81) nk =1 h θ,(cid:15) (cid:0) Φ θ ( y k − ) ( x ) , y k (cid:1) shown in (2)-(3). In this paper we will focus only on MCMCalgorithms and in particular on the Metropolis-Hastings (M-H) approach. In order to sample from the posterior π (cid:15) one runs an ergodic Markov Chain with the invariant density being π (cid:15) . Then after a few iterations when the chainhas reached stationarity, one can treat the samples from the chain as approximate samples from π (cid:15) . This is shownin Algorithm 1, where for convenience we denote γ = ( θ, x ). The one-step transition kernel of the MCMC chain isusually described as the M-H kernel and follows from Step 2 in Algorithm 1.Unfortunately p (cid:15)θ,x ( y n ) is not available analytically and cannot be evaluated, so this rules out the possibility ofusing of traditional MCMC approaches like Algorithm 1. However, one can resort to the so called pseudo-marginal approach whereby unbiased estimates of p (cid:15)θ,x ( y n ) are used instead within an MCMC algorithm. We will referto this algorithm as ABC-MCMC. The resulting algorithm can be posed as one targeting a posterior deﬁned onan extended state space, so that its marginal coincides with π (cid:15) ( θ, x | y n ). We will use these ideas to presentABC-MCMC as a M-H algorithm which is an exact approximation to an appropriate marginal algorithm.To illustrate an example of these ideas, we proceed by writing a posterior on an extended state-space Θ × X × Y n as follows: π (cid:15)n ( θ, x , u n | y n ) ∝ n (cid:89) k =1 I B (cid:15) ( y k ) ( u k ) h θ (cid:0) Φ θ ( y k − ) ( x ) , y k (cid:1) ξ ( x , θ ) . (5)It is clear that (2) is the marginal of (5) and hence the similarity in the notation. As we will show later in thissection, extending the target space in the posterior as in (5) is not the only and certainly not the best choice.We emphasise that the only essential requirement for each choice is that the marginal of the extended target is π (cid:15) ( θ, x | y n ), but one should be cautious because the particular choice will aﬀect the mixing properties and theeﬃciency of the MCMC scheme that will be used to sample from π (cid:15)n ( θ, x , u n | y n ) in (5) or another variant.4 lgorithm 1 A marginal M-H algorithm for π (cid:15) ( γ | y n )1. (Initialisation) At t = 0 sample γ ∼ ξ .2. (M-H kernel) For t ≥ • Sample γ (cid:48) | γ t − from a proposal Q ( γ t − , · ) with density q ( γ t − , · ). • Accept the proposed state and set γ t = γ (cid:48) with probability1 ∧ p (cid:15)γ (cid:48) ( y n ) p (cid:15)γ t − ( y n ) × ξ ( γ (cid:48) ) q ( γ (cid:48) , γ t − ) ξ ( γ t − ) q ( γ t − , γ (cid:48) ) , otherwise set γ t = γ t − . Set t = t + 1 and return to the start of 2. Algorithm 2

M-H Proposal for basic ABC MCMC • Sample γ (cid:48) | γ from a proposal Q ( γ, · ) with density q ( γ, · ). • Sample u (cid:48) n from a distribution with joint density (cid:81) nk =1 h θ (cid:48) (Φ θ (cid:48) ( y k − )( x ) , u k ) • Accept the proposed state ( γ (cid:48) , u (cid:48) n ) with probability:1 ∧ (cid:81) nk =1 I B (cid:15) ( y k ) ( u (cid:48) k ) (cid:81) nk =1 I B (cid:15) ( y k ) ( u k ) × ξ ( γ (cid:48) ) q ( γ (cid:48) , γ ) ξ ( γ ) q ( γ, γ (cid:48) ) . We will now look at two basic diﬀerent choices for extending the ABC posterior while keeping the marginal ﬁxedto π (cid:15) ( θ, x | y n ). In the remainder of the paper we will denote γ = ( θ, x ) as we did in Algorithm 1.Initially consider the ABC approximation when be extended to the space Θ × X × Y n : π (cid:15) ( γ, u n | y n ) = ξ ( γ ) p (cid:15)γ ( y n ) (cid:82) ξ ( γ ) p (cid:15)γ ( y n ) dγ (cid:81) nk =1 I B(cid:15) ( yk ) ( u k ) µ ( B (cid:15) (0)) p (cid:15)γ ( y n ) n (cid:89) k =1 h θ (Φ θ ( y k − )( x ) , u k ) . Recall one cannot evaluate h θ (Φ θ ( x )( u k ) , u k ) and is only able to simulate from it. In Algorithm (2) we presenta natural M-H proposal that could be used to sample from π (cid:15) ( γ, u n | y n ) instead of the one shown Step 2 atAlgorithm 1. Note that this time the state of the MCMC chain is composed of ( γ, u n ). Here each u k assumes therole of an auxiliary variable to be eventually integrated out at the end of the MCMC procedure.However, as n increases, the M-H kernel in Algorithm 2 will have an acceptance probability that falls quicklywith n . In particular, for any ﬁxed γ , the probability of obtaining such a sample will fall at an exponential rate in n . This means that this basic ABC MCMC approach will be ineﬃcient for a moderate value of n .This issue can be dealt with by using N multiple trials, so that at each k , some auxiliary variables (or pseudo-observations) are in the ball B (cid:15) ( y k ). This idea originates from [3, 12] and in fact augments the posterior to a largerstate-space, Θ × X × Y nN , in order to target the following density: (cid:101) π (cid:15) ( γ, u N n | y n ) = π ( γ ) p (cid:15)γ ( y n ) (cid:82) π ( γ ) p (cid:15)γ ( y n ) dγ (cid:81) nk =1 (cid:80) Nj =1 I B(cid:15) ( yk ) ( u jk ) Nµ ( B (cid:15) (0)) p (cid:15)γ ( y n ) n (cid:89) k =1 N (cid:89) j =1 h θ (Φ θ ( y k − )( x ) , u (cid:48) jk )Again, it is easy to show that the marginal of interest π (cid:15) ( γ | y n ) is preserved, i.e. π (cid:15) ( γ | y n ) = (cid:90) Y nN (cid:101) π (cid:15) ( γ, u N n | y n ) du N n = (cid:90) Y n π (cid:15) ( γ, u n | y n ) du n . In Algorithm 3 we present an M-H kernel with invariant density (cid:101) π (cid:15) . The state of the MCMC chain now is (cid:0) γ, u N n (cid:1) .We remark that as N grows, one expects to recover the properties of the ideal M-H algorithm in Algorithm 1.5 lgorithm 3 M-H Proposal for ABC with N trials • Sample γ (cid:48) | γ from a proposal Q ( γ, · ) with density q ( γ, · ). • Sample u (cid:48) N n from a distribution with joint density (cid:81) nk =1 (cid:81) Nj =1 h θ (cid:48) (Φ θ (cid:48) ( y k − )( x ) , u (cid:48) jk ) . • Accept the proposed state (cid:16) γ (cid:48) , u (cid:48) N n (cid:17) with probability:1 ∧ (cid:81) nk =1 ( N (cid:80) Nj =1 I B (cid:15) ( y k ) ( u (cid:48) jk )) (cid:81) nk =1 ( N (cid:80) Nj =1 I B (cid:15) ( y k ) ( u jk )) × π ( γ (cid:48) ) q ( γ (cid:48) , γ ) π ( γ ) q ( γ, γ (cid:48) ) . Nevertheless, it has been shown in [11] that even the M-H kernel in Algorithm 3 does not always perform well. Itcan happen that the chain gets often stuck in regions of the state-space Θ × X where α k ( y k , (cid:15), γ ) := (cid:90) B (cid:15) ( y k ) h θ (Φ θ ( y k − )( x ) , u ) du is small. We will address this shortfall detailed above, by proposing an alternative augmented target and corresponding M-Hkernel. The basic idea is that a random number of trials is used based on the value of α k ( y k , (cid:15), γ ). Then it will bepossible to use more computational eﬀort when the chain is at regions where α k ( y k , (cid:15), γ ) is low.Consider an alternative extended target, for N ≥ m k ∈ M N := { N, N + 1 , . . . , } , 1 ≤ k ≤ n :ˆ π (cid:15) ( γ, m n | y n ) = π ( γ ) p (cid:15)γ ( y n ) (cid:82) π ( γ ) p (cid:15)γ ( y n ) dγ (cid:81) nk =1 N − µ ( B (cid:15) (0))( m k − p (cid:15)γ ( y n ) n (cid:89) k =1 (cid:18) m k − N − (cid:19) α k ( y k , γ, (cid:15) ) N (1 − α k ( y k , γ, (cid:15) )) m k − N . Standard results for negative binomial distrubutions (see [14, 15] for more details) imply that ∞ (cid:88) m k = N m k − (cid:18) m k − N − (cid:19) α k ( y k , (cid:15), γ ) N (1 − α k ( y k , (cid:15), γ )) m k − N = α k ( y k , (cid:15), γ ) N − n (cid:89) k =1 N − µ ( B (cid:15) (0))( m k − p (cid:15)γ ( y n ). In addition, from (6) it follows that the marginal w.r.t. γ is the one of interest: π (cid:15) ( γ | y n ) = (cid:88) m n ∈ M nN (cid:98) π (cid:15) ( γ, m n | y n )In Algorithm 4 we present a M-H kernel with invariant density (cid:98) π (cid:15) . The state of the MCMC chain this time is( γ, m n ).The potential beneﬁt of this kernel is that one expects the probability of accepting a proposal is higher than theprevious M-H kernel (for a given N ). This comes at a computational cost which is both increased and random. Theproposed kernel is based on the N − hit kernel of [10], which has been adapted here to account for the data being asequence of observations resulting from a time series. Finally, it is important to mention that in Algorithms 3 and4 generating multiple trials can be implemented very eﬃciently in parallel using appropriate computing hardware,such as multi-core processors or computing clusters. 6 lgorithm 4 M-H Proposal with a random number of trials • Sample γ (cid:48) | γ from a proposal Q ( γ, · ) with density q ( γ, · ). • For k = 1 , . . . , n repeat the following: sample u k , u k , . . . with probability density h θ (cid:48) (Φ θ (cid:48) ( y k − )( x (cid:48) ) , u k ) untilthere are N samples lying in B (cid:15) ( y k ); the number of samples to achieve this (including the successful trial) is m (cid:48) k . • Accept ( γ (cid:48) , m (cid:48) n ) with probability: 1 ∧ (cid:81) nk =1 1 m (cid:48) k − (cid:81) nk =1 1 m k − × π ( γ (cid:48) ) q ( γ (cid:48) , γ ) π ( γ ) q ( γ, γ (cid:48) ) . N To implement the proposed kernel, one needs to select N . In practice to it is diﬃcult to know a good value this apriori , so we present a theoretical result that can add some intuition on choosing N . Let E γ,N [ · ] denote expectationw.r.t. (cid:81) nk =1 (cid:0) m k − N − (cid:1) α k ( y k , (cid:15), γ ) N (1 − α k ( y k , (cid:15), γ )) m k − N given γ, N . We will also pose the assumption:( A6 ) For any ﬁxed (cid:15) > γ ∈ Θ × X , we have α k ( y k , (cid:15), γ ) > Proposition 3.1.

Assume (A6) and let β ∈ (0 , , n ≥ and N ≥ n − β ∨ . Then for ﬁxed ( γ, (cid:15) ) ∈ Θ × X × R + wehave E γ,N (cid:20)(cid:18) (cid:81) nk =1 1 M k − (cid:81) nk =1 α k ( y k ,(cid:15),γ ) N − − (cid:19) (cid:21) ≤ CnN where C = 1 /β . The result shows that one should set N = O ( n ) for the relative variance not to grow with n , which is unsuprising,given the conditional independence structure of the m n . To get a better handle on the variance, suppose n = 1,then for γ ﬁxed V ar γ,N (cid:104) N N (cid:88) j =1 I B (cid:15) ( y ) ( u j ) (cid:105) = α ( y , (cid:15), γ )(1 − α ( y , (cid:15), γ )) N .

For the new approach one can show V ar γ,N (cid:104) N − M − (cid:105) ≤ α ( y , (cid:15), γ ) ( N − . Not taking into account the computational cost, one prefers this new estimate with regards to variance if NN − ≤ − α ( y , (cid:15), ξ ) α ( y , (cid:15), γ )which is likely to occur if α ( y , (cid:15), γ ) is not too large (recall we want (cid:15) to be small, so that we have a goodapproximation of the true posterior) and N is moderate - this is precisely the scenario in practice. Remark 3.1.

It is easily shown that the relative variance associated to the estimate (cid:81) nk =1 (cid:20)(cid:16) N (cid:80) Nj =1 I B (cid:15) ( y k ) ( u jk ) (cid:17)(cid:21) is n (cid:89) k =1 (cid:104) α k ( y k , (cid:15), γ ) N − N − N (cid:105) − . Note this quantity is not uniformly upper-bounded in γ unless inf k,γ α k ( y k , (cid:15), γ ) ≥ C > , which may not occur.Conversely, Proposition 3.1 shows that the relative variance of the new estimator is uniformly upper-bounded in γ under minimal conditions. We suspect that this means in practice that the kernel with random number of trials maymix faster . .2.2 Computational considerations As the cost per-iteration is random, we will investigate this further. We denote the proposal of γ, m n as ˜ Q . Let ζ be the initial distribution of the MCMC chain and ζK t the distribution of the state at time t . In addition, denoteby m tk the proposed state for m k at iteration t . Finally, we will write as E ζK t ⊗ ˜ Q the expectation of a randomvariable proposed by ˜ Q given the simulated state at time t . We will assume that the observations are ﬁxed andknown. Then we have the following result: Proposition 3.2.

Let (cid:15) > , and suppose that there exists a constant C > such that for any n ≥ we have inf k α k ( y k , γ, (cid:15) ) ≥ C , µ − a.e.. Then it holds for any N ≥ , t ≥ , that: E ζK t ⊗ ˜ Q [ n (cid:88) k =1 M tk ] ≤ nNC . The expected computational cost grows linearly with n . Thus, coupled with the result in Proposition 3.1, onehas a cost of O ( n ) per-iteration, which is comparable to many exact approximations of MCMC algorithms (e.g. [1]),albeit in a much simpler situation. Note also that the kernel in Algorithm 3 is expected to require a cost of O ( n )per iteration for reasonable performance, although this cost here is deterministic. As mentioned above, one expectsthe approach with random number of trials to work better with regards to the mixing time, especially when thevalues of α k ( y k , (cid:15), γ ) are not large. We attribute this to Algorithm 4 providing a more ‘targetted’ way to use thesimulated auxiliary variables. This will be illustrated numerically in Section 4. p (cid:15)γ ( y n ) with the eﬃciency of ABC-MCMC A comparison of our results with the interesting work in [7] seems relevant. There the authors deal with a moregeneral context and show that we should choose N as a particular asymptotic (in N ) variance; the main point isthat the (asymptotic) variance of the estimate of p (cid:15)γ ( y n ) should be the same for each γ . We conjecture that in ourset-up one should choose N such that the actual variance of the estimate of p (cid:15)γ ( y n ) is constant with respect to γ .In this scenario, on inspection of the proof of Proposition 3.1, for a given γ , one should set N to be the solution of (cid:16) n (cid:89) k =1 α k ( y k , (cid:15), γ ) (cid:17)(cid:16) N − n ( N − n − N − n (cid:17) = C for some desired (upper-bound on the) variance C (whose optimal value would need to be obtained). This makes N a random variable, in addition, but does not change the simulation mechanism. Unfortunately, one cannot dothis in practice, as the α k ( y k , (cid:15), γ ) are unknown. Taking into account Remark 3.1, this latter approach may notbe so much of a concern in practice. We conclude this discussion by adding a related comment regarding the ergodicity of the proposed MCMC kernel.If there exists a constant

C < ∞ such that 1 (cid:81) nk =1 α k ( y k , (cid:15), γ ) ≤ C (cid:98) π (cid:15) ( dγ | y n ) − a.e. and the marginal MCMC kernel in Algorithm 1 is geometrically ergodic, then by [2, Propositions 7, 9] the MCMCkernel of Algorithm 4 is also geometrically ergodic. For this example let each Y k , X k , θ be a scalar real random variable and consider the model: Y k +1 = θX k + κ k , X k +1 = X k X = 1 and κ k i.i.d. ∼ N (0 , σ ), where we denote N (0 , σ ) the zero mean normal distribution with variance σ .The prior on θ is N (0 , φ ). This model is usually referred to as the standard normal means model in one dimensionand the posterior is given by: θ | y n ∼ N (cid:16) σ n σ n (cid:88) k =1 y k , σ n (cid:17) where σ n = (cid:16) φ + nσ (cid:17) − . Note that if Y k i.i.d. ∼ N ( θ ∗ , σ ), then the posterior on θ is consistent and concentratesaround θ ∗ as n → ∞ .The ABC approximation after marginalizing out the auxiliary variables has a likelihood given by: p (cid:15)θ ( y n ) = 1 (cid:15) n n (cid:89) k =1 (cid:104) F (cid:16) y k + (cid:15) − θσ (cid:17) − F (cid:16) y k − (cid:15) − θσ (cid:17)(cid:105) where F is the standard normal cumulative density function. Thus, this is a scenario where we can perform themarginal MCMC. Three data sets are generated from the model with n ∈ { , , } and σ = 1. In addition, for (cid:15) = 1 weperturb the data-sets in order to use them for noisy ABC. For the sake of comparison, we also generate a noisyABC data-set for (cid:15) = 100. We will also use a prior with φ = 1.We run the new MCMC kernel (the proposal in Algorithm 4 - we will frequently use the expression ‘Algorithm’to mean an MCMC kernel with the given proposal mechansim of the Algorithm), old MCMC kernel (Algorithm3) and a Marginal MCMC algorithm which just samples on the parameter space R (i.e. the posterior density isproportional to p (cid:15)θ ( y n ) π ( θ )). Each algorithm is run with a normal random walk proposal on the parameter space,with the same scaling. The scaling chosen yields an acceptance rate of around 0.25 for each run of the marginalMCMC algorithm. The new MCMC kernel is run with N = n and the old with a slightly higher value of N so thatthe computational times are about the same (so for example, the running time of the new kernel is not a problemin this example). The algorithms are run for 10000 iterations and the results can be found in Figures 1-3.In Figure 1 the density plots for the posterior samples on θ , from the marginal MCMC can be seen for (cid:15) ∈ { , } and each value of n . When (cid:15) = 1, we can observe that both ABC and noisy ABC both get closer to the true posterioras n grows. For noisy ABC, this is the behavior that is predicted in Section 2.2. For the ABC approximation,following the proof of Theorem 1 in [8], one can see that the bias falls with (cid:15) ; hence, in this scenario there is not asubstantial bias for the standard ABC approximation. When we make (cid:15) much larger a more pronounced diﬀerencebetween ABC and noisy ABC can be seen and it appears as n grows that the noisy ABC approximation is slightlymore accurate (relative to ABC).We now consider the similarity of the new and old MCMC kernels to the marginal algorithm (i.e. the kernelboth procedures attempt to approximate), the results are in Figures 2-3. With regards to both the density plots(Figure 2) and auto-correlations (Figure 3) we can see that both MCMC kernels appear to be quite similar to themarginal MCMC. It is also noted that the acceptance rates of these latter kernels are also not far from that of themarginal algorithm (results not shown). These results are unsuprising, given the simplicity of the density that wetarget, but still reassuring; a more comprehensive comparison is given in the next example. Encouragingly, the newand old MCMC kernels do not seem to noticably worsen as n grows; this shows that, at least for this example, therecommendation of N = O ( n ) is quite useful. We remark that whilst these results are for a single batch of data,the results are consistent with other data sets. Set, for ( Y k , X k ) ∈ R × R + Y k +1 = κ k k ∈ N X k +1 = β + β X k + β Y k +1 k ∈ N where κ k | x k ind ∼ S (0 , x k , ϕ , ϕ ) (i.e. a stable distribution, with location 0, scale X k and asymmetry and skewnessparameters ϕ , ϕ ). We set X ∼ G a ( a, b ) , β , β , β ∼ G a ( c, d )9here G a ( a, b ) is a Gamma distribution with mean a/b and θ = ( β ) ∈ ( R + ) . This is a GARCH(1,1) model withan intractable likelihood. We consider daily log-returns data from the S&P 500 index from 03/1/11 to 14/02/13, which constitutes 533 data-points. In the priors, we set a = c = 2 and b = d = 1 /

8, which are not overly informative. In addition, ϕ = 1 . ϕ = 0. We consider (cid:15) ∈ { . , . } and only a noisy ABC approximation of the model. Algorithms 3 and 4 are tobe compared. The MCMC proposals on the parameters are random-walks on the log-scale and for both algorithmswe set N = 250. It should be noted that our results are fairly robust to changes in N ∈ [100 , (cid:15) = 0 .

5. Algorithm 3 tookabout 0.30 seconds per iteration and Algorithm 4 took about 1.12 seconds per iteration We modiﬁed the proposalvariances to yield an acceptance rate around 0.3. The plot shows that both algorithms appear to move across thestate-space in a very reasonable way. The new algorithm takes much longer and in this situation does not appearto be required. This run is one of many we performed and we observed this behaviour in many of our runs.In Figure 5 we can observe the trace plots from a particular (typical) run when (cid:15) = 0 .

01. In this case, bothalgorithms are run for 200000 iterations. Algorithm 3 took about 0.28 seconds per iteration and Algorithm 4 tookabout 2.06 seconds per iteration; this issue is discussed below. In this scenario, considerable eﬀort was expendedfor Algorithm 3 to yield an acceptance rate around 0.3, but despite this, we were unable to make the algorithmtraverse the state-space. In contrast, with less eﬀort, Algorithm 4 appears to perform quite well and move aroundthe parameter space (the acceptance rate was around 0.15 versus 0.01 for Algorithm 3). Whilst the computationaltime for Algorithm 4 is considerably more than Algorithm 3, in the same amount of computation time, it stillmoves more around the state space; algorithm runs of the same length are provided for presentational purposes.We remark that, whilst we do not claim that it is ‘impossible’ to make Algorithm 3 mix well in this example, wewere unable to do so and, alternatively, for Algorithm 4 we expended considerably less eﬀort for very reasonableperformance. This example is typical of many runs of the algorithm and examples we have investigated and isconsistent with the discussion in Section 3.2.2, where we stated that Algorithm 4 is likely to out-perform Algorithm3 when the α k ( y k , (cid:15), γ ) are not large, which is exactly the scenario in this example.Turning to the cost of simulating Algorithm 4; for the case (cid:15) = 0 . (cid:15) = 0 .

01 this ﬁgure was 330000. In this example signifcant eﬀort is expended insimulating the m n . This shows, at least in this example, that one can run the algorithm without it failing tosample the m n . The results here suggest that one should prefer Algorithm 4 only in challenging scenarios, as itcan be very expensive in practice.Finally, we remark that the MLE for a Gaussian Garch model, is β = (4 . × − , . , . In this article we have considered approximate Bayesian inference from observation driven time series models. Welooked at some consistency properties of the corresponding MAP estimators and also proposed an eﬃcient ABC-MCMC algorithm to sample from these approximate posteriors. The performance of the latter was illustrated usingnumerical examples.There are several interesting extensions to this work: • the asymptotic analysis of the ABC posterior in Section 2.2 can be further extended. For example, one mayconsider Bayesian consistency or Bernstein Von-Mises theorems, which could provide further justiﬁcation tothe approximation that was introduced here. Alternatively, one could look at the the asymptotic bias of theABC posterior w.r.t. (cid:15) or the asymptotic loss in eﬃciency of the noisy ABC posterior w.r.t. (cid:15) similar to thework in [5] for hidden Markov models. • the geometric ergodicity of the presented MCMC sampler can be further investigated in the spirit of [2, 11]. • an investigation to extend the ideas here for sequential Monte Carlo methods should be beneﬁcial. This hasbeen initiated in [9] in the context of particle ﬁltering for a diﬀerent class of models.10 cknowledgements A. Jasra acknowledges support from the MOE Singapore and funding from Imperial College London. N. Kantaswas kindly funded by EPSRC under grant EP/J01365X/1.

A Proofs for Section 2

Proof. [Proof of Proposition 2.1] The proof of lim n →∞ d ( θ n,x,(cid:15) , Θ ∗ (cid:15) ) = 0 P θ ∗ − a.s. follows from [6, Theorem 21] ifwe can establish conditions (B1-3) for our perturbed ABC model. Clearly (B1) and part of (B2) holds. (B3-i) holdvia [6, Lemma 21] via (A4). We ﬁrst need to show that for any y ∈ Y that x (cid:55)→ h (cid:15) ( x, y ) is continuous. Consider | h (cid:15) ( x, y ) − h (cid:15) ( x (cid:48) , y ) | = 1 µ ( B (cid:15) (0)) | (cid:90) B (cid:15) ( y ) ( h ( x, y ) − h ( x (cid:48) , y )) µ ( dy ) | . Let ε >

0, then, by (A3) there exists a δ > d ( x, x (cid:48) ) < δ sup y ∈ Y | h ( x, y ) − h ( x (cid:48) , y ) | < ε and hence for ( x, x (cid:48) ) as above | h (cid:15) ( x, y ) − h (cid:15) ( x (cid:48) , y ) | < ε. which establishes (B2) of [6]. Now, for (B3-ii) of [6], we note that as h ≤ h (cid:15) ( x, y ) ≤ h < ∞ (see (A3)) the logfunction is Lipshitz and | log( h (cid:15) (Φ θ ( Y k − )( x ) , Y k )) − log( h (cid:15) (Φ θ ( Y −∞ : k − )( x ) , Y k )) | ≤ C | h (cid:15) (Φ θ ( Y k − )( x ) , Y k ) − h (Φ θ ( Y −∞ : k − )( x ) , Y k ) | for some C < ∞ that does not depend upon Y −∞ : k − , Y k , x, (cid:15) . Now | h (cid:15) (Φ θ ( Y k − )( x ) , Y k ) − h (Φ θ ( Y −∞ : k − )( x ) , Y k ) | = ( µ ( B (cid:15) (0))) − | (cid:90) B (cid:15) ( Y k ) [ h (Φ θ ( Y k − )( x ) , y ) − h (Φ θ ( Y −∞ : k − )( x ) , y )] µ ( dy ) | and | (cid:90) B (cid:15) ( Y k ) (cid:0) h (Φ θ ( Y k − )( x ) , y ) − h (Φ θ ( Y −∞ : k − )( x ) , y ) (cid:1) µ ( dy ) | ≤ µ ( B (cid:15) (0)) sup y ∈ Y | h (Φ θ ( Y k − )( x ) , y ) − h (Φ θ ( Y −∞ : k − )( x ) , y )] | . Thus, by (A4) and the fact that (B3-i) of [6] holds:lim k → sup θ ∈ Θ | log( h (cid:15) (Φ θ ( Y k − )( x ) , Y k )) − log( h (cid:15) (Φ θ ( Y −∞ : k − )( x ) , Y k )) | = 0 P θ ∗ − a.s. Note, ﬁnally that (B3-iii) trivially follows by h (cid:15) ( x, y ) ≤ h < ∞ . Hence we have proved thatlim n →∞ d ( θ n,x,(cid:15) , Θ ∗ (cid:15) ) = 0 P θ ∗ − a.s.. Proof. [Proof of Proposition 2.2] This result follows from [6, Proposition 23]. One can establish assumptions (B1-3)of [6] using the proof of Proposition 2.1. Thus we need only prove that H (cid:15) ( x, · ) = H (cid:15) ( x (cid:48) , · ) , ⇔ x = x (cid:48) . Now, for any A ∈ Y H (cid:15) ( x, A ) = 1 µ ( B (cid:15) (0)) (cid:90) A [ (cid:90) B (cid:15) ( y ) H ( x, du )] µ ( dy ) . By (A5) (cid:82) B (cid:15) ( y ) H ( x, du ) = (cid:82) B (cid:15) ( y ) H ( x (cid:48) , du ) ⇔ x = x (cid:48) , so H (cid:15) ( x, A ) = 1 µ ( B (cid:15) (0)) (cid:90) A [ (cid:90) B (cid:15) ( y ) H ( x (cid:48) , du )] µ ( dy ) ⇔ x = x (cid:48) which completes the proof. 11 Proof for Section 3

Proof. [Proof of Proposition 3.1] We have E γ,N (cid:20)(cid:18) (cid:81) nk =1 1 M k − (cid:81) nk =1 α k ( y k ,(cid:15),γ ) N − − (cid:19) (cid:21) = 1( (cid:81) nk =1 α k ( y k ,(cid:15),γ ) N − ) (cid:18) n (cid:89) k =1 E γ,N (cid:104) M k − (cid:105) − (cid:16) n (cid:89) k =1 α k ( y k , (cid:15), γ ) N − (cid:17) (cid:19) . Now, by [14, 15] ( N ≥

3) for any k ≥ E γ,N (cid:104) M k − M k − (cid:105) = α k ( y k , (cid:15), γ ) ( N − N − E γ,N (cid:104) M k − (cid:105) ≤ α k ( y k , (cid:15), γ ) ( N − N − . hence E γ,N (cid:20)(cid:18) (cid:81) nk =1 1 M k − (cid:81) nk =1 α k ( y k ,(cid:15),γ ) N − − (cid:19) (cid:21) ≤ ( N − n (cid:16) N − n ( N − n − N − n (cid:17) . (7)Now the R.H.S.of (7) is equal to nN n − + (cid:80) ni =2 (cid:0) ni (cid:1) N n − i [( − i − ( − i ] N n − nN n − + (cid:80) ni =2 (cid:0) ni (cid:1) N n − i ( − i . (8)Now, we will show n (cid:88) i =2 (cid:18) ni (cid:19) N n − i [( − i − ( − i ] ≤ . (9)The proof is given when n is odd. The case n even follows by the proof as n − k ∈ { , , . . . , ( n − / } that the sum of consecutive even and odd terms is equal to N n − k n !( n − k − k )! (cid:20) N (1 − k )(2 k + 1) − (2 k +1 − n − k )( n − k )(2 k + 1) N (cid:21) which is negative as N ≥ (2 k +1 − n − k )(1 − k )(2 k + 1) . Thus we have established (9). We will now show that n (cid:88) i =2 (cid:18) ni (cid:19) N n − i ( − i ≥ . (10)Following the same approach as above (i.e. n is odd) the sum of consecutive even and odd terms is equal to N n − k k n !( n − k − k )! (cid:20) N (2 k + 1) − n − k )( n − k )(2 k + 1) N (cid:21) . This is positive if N ≥ n − k k + 1 ≤ n . as N ≥ n (1 − β ) and 6 ≥ (1 − β ) it follows that N ≥ n/ ≥ ( n − k ) / (2 k + 1); thus one can establish (10).Now returning to (7) and noting (8), (9) and (10), we have E γ,N (cid:20)(cid:18) (cid:81) nk =1 1 M k − (cid:81) nk =1 α k ( y k ,(cid:15),γ ) N − − (cid:19) (cid:21) ≤ nN n − N n − nN = nN − n as N ≥ / (1 − β ) it follows that n/ ( N − n ) ≤ Cn/N and we conclude.12 roof. [Proof of Proposition 3.2] We have E ζK i ⊗ ˜ Q [ n (cid:88) k =1 M k ] = (cid:90) (Θ × X ) (cid:88) M nN (cid:16) n (cid:88) k =1 m k (cid:17)(cid:110) n (cid:89) k =1 (cid:18) m k − N − (cid:19) α k ( y k , γ (cid:48) , (cid:15) ) N (1 − α k ( y k , γ (cid:48) , (cid:15) )) m k − N (cid:111) q ( γ, γ (cid:48) ) ζK i ( dγ ) dγ (cid:48) = (cid:90) (Θ × X ) (cid:16) n (cid:88) k =1 Nα k ( y k , γ, (cid:15) ) (cid:17) q ( γ, γ (cid:48) ) ζK i ( dγ ) dγ (cid:48) ≤ nNC . where we have used the expectation of a negative-binomial random variable and applied inf k α k ( y k , γ, (cid:15) ) ≥ C , µ − a.e. in the inequality References [1]

Andrieu , C.,

Doucet , A. &

Holenstein , R. (2010). Particle Markov chain Monte Carlo methods (withdiscussion).

J. R. Statist. Soc. Ser. B , , 269–342.[2] Andrieu , C. &

Vihola , M. (2012). Convergence properties of pseudo-marginal Markov chain Monte Carloalgorithms. arXiv:1210.1484 [math.PR][3]

Beaumont , M. A. (2003). Estimation of population growth or decline in genetically monitored populations.

Genetics , , 1139.[4] Cox , D. R. (1981). Statistical analysis of time-series: some recent developments.

Scand. J. Statist. , 93–115.[5] Dean , T. A.,

Singh , S. S.,

Jasra , A. &

Peters

G. W. (2010). Parameter estimation for Hidden Markovmodels with intractable likelihoods. arXiv:1103.5399 [math.ST][6]

Douc , R.,

Doukhan , P. &

Moulines , E. (2012). Ergodicity of observation-driven time series models andconsistency of the maximum likelihood estimator. arXiv:1210.4739 [math.ST].[7]

Doucet , A.,

Pitt , M., &

Kohn , R. (2012). Eﬃcient implementation of Markov chain Monte Carlo whenusing an unbiased likelihood estimator. arXiv:1210.1871 [stat.ME].[8]

Jasra , A.,

Singh,

S. S.,

Martin , J. S. &

McCoy , E. (2012). Filtering via approximate Bayesian computation.

Statist. Comp. , , 1223–1237.[9] Jasra , A.,

Lee , A.,

Yau , C. &

Zhang , X. (2013). The alive particle ﬁlter. in preparation.[10]

Lee , A. (2012). On the choice of MCMC kernels for approximate Bayesian computation with SMC samplers.

In Proc. Winter Sim. Conf. .[11]

Lee , A. &

Latuszynski , K. (2012). Variance bounding and geometric ergodicity of Markov chain Monte Carlofor approximate Bayesian computation. arXiv:1210.6703 [stat.ME].[12]

Majoram , P.,

Molitor , J.,

Plagnol , V. &

Tavare , S. (2003). Markov chain Monte Carlo without likeli-hoods.

Proc. Nat. Acad. Sci. , , 15324–15328.[13] Marin , J.-M.,

Pudlo , P.,

Robert , C.P. &

Ryder , R. (2012). Approximate Bayesian computational methods.

Statist. Comp. , , 1167–1180.[14] Neuts , M. F. &

Zacks , S. (1967). On mixtures of χ and F − distributions which yield distributions of thesame family. Ann. Inst. Stat. Math. , , 527–536.[15] Zacks , S. (1980). On some inverse moments of negative-binomial distributions and their application in esti-mation.

J. Stat. Comp. & Sim. , , 163-165. 13 MCMC-ABCnoisy MCMC-ABCtrue posteriortrue theta (a) n = 10 , (cid:15) = 1 MCMC-ABCnoisy MCMC-ABCtrue posteriortrue theta (b) n = 10 , (cid:15) = 100 MCMC-ABCnoisy MCMC-ABCtrue posteriortrue theta (c) n = 100 , (cid:15) = 1 MCMC-ABCnoisy MCMC-ABCtrue posteriortrue theta (d) n = 100 , (cid:15) = 100 MCMC-ABCnoisy MCMC-ABCtrue posteriortrue theta (e) n = 1000 , (cid:15) = 1 MCMC-ABCnoisy MCMC-ABCtrue posteriortrue theta (f) n = 1000 , (cid:15) = 100 Figure 1: Marginal MCMC Density Plots. In each plot the true posterior (black), noisy ABC (orange), ABC (green)densities of θ are plotted, for diﬀerent values of n (10, 1st row, 100, 2nd row, 1000, 3rd row) and (cid:15) (1, 1st column,100, 2nd column). The black vertical line is the value of θ that generated the data.14 MCMC-ABCMCMC-ABC oldtrue thetatrue posterior (a) n = 10, ABC MCMC-ABC noisyMCMC-ABC noisy oldtrue thetatrue posterior (b) n = 10, Noisy ABC MCMC-ABCMCMC-ABC oldtrue thetatrue posterior (c) n = 100, ABC MCMC-ABC noisyMCMC-ABC noisy oldtrue thetatrue posterior (d) n = 100, Noisy ABC MCMC-ABCMCMC-ABC oldtrue thetatrue posterior (e) n = 1000, ABC MCMC-ABC noisyMCMC-ABC noisy oldtrue thetatrue posterior (f) n = 1000, Noisy ABC Figure 2: MCMC Density Plots. In each plot the true posterior (black), ABC (green, 1st col) or noisy ABC (orange,2nd col) densities of θ are plotted, for diﬀerent values of n (10, 1st row, 100, 2nd row, 1000, 3rd row). In additionthe plots are for the new and old (the *) MCMC kernels. The black vertical line is the value of θ that generatedthe data. Throughout (cid:15) = 1 15 MCMC-ABCnoisy MCMC-ABC (a) n = 10, Marginal -0.1-0.0500.050.1 0 200 400 600 800 1000 MCMC-ABCMCMC-ABC noisyMCMC-ABC oldMCMC-ABC noisy old (b) n = 10, Standard -0.1-0.0500.050.1 0 200 400 600 800 1000 MCMC-ABCnoisy MCMC-ABC (c) n = 100, Marginal -0.1-0.0500.050.1 0 200 400 600 800 1000 MCMC-ABCMCMC-ABC noisyMCMC-ABC oldMCMC-ABC noisy old (d) n = 100, Standard -0.1-0.0500.050.1 0 200 400 600 800 1000 MCMC-ABCnoisy MCMC-ABC (e) n = 1000, Marginal -0.1-0.0500.050.1 0 200 400 600 800 1000 MCMC-ABCMCMC-ABC noisyMCMC-ABC oldMCMC-ABC noisy old (f) n = 1000, Standard Figure 3: Marginal MCMC and Standard Auto-Correlation Plots. In each plot in column 1 the auto-correlationfor every 10th iteration is plotted for noisy ABC (orange) and ABC (green). In each plot in column 2 the auto-correlation over the same period is plotted for both the new and old MCMC kernels, for both noisy ABC (orangefor new and blue for old) and ABC (green for new, red for old). Three diﬀerent values of n are presented and (cid:15) = 1throughout. 16 MCMC-ABC noisy oldMCMC-ABC noisy (a) X MCMC-ABC noisy oldMCMC-ABC noisy (b) β MCMC-ABC noisy oldMCMC-ABC noisy (c) β MCMC-ABC noisy oldMCMC-ABC noisy (d) β Figure 4: Trace, (cid:15) = 0 .

5. We run both algorithms for 50000 iterations, the new kernel is in orange trace and N = 250 in both cases and the algorithms are run the S & P 500 data.17 (a) X (b) β (c) β (d) β Figure 5: Trace, (cid:15) = 0 .