[PDF] Sequential Monte Carlo algorithms for agent-based models of disease transmission

Abstract

Agent-based models of disease transmission involve stochastic rules that specify how a number of individuals would infect one another, recover or be removed from the population. Common yet stringent assumptions stipulate interchangeability of agents and that all pairwise contact are equally likely. Under these assumptions, the population can be summarized by counting the number of susceptible and infected individuals, which greatly facilitates statistical inference. We consider the task of inference without such simplifying assumptions, in which case, the population cannot be summarized by low-dimensional counts. We design improved particle filters, where each particle corresponds to a specific configuration of the population of agents, that take either the next or all future observations into account when proposing population configurations. Using simulated data sets, we illustrate that orders of magnitude improvements are possible over bootstrap particle filters. We also provide theoretical support for the approximations employed to make the algorithms practical.

Full PDF

SSequential Monte Carlo algorithms for agent-based models ofdisease transmission

Nianqiao Ju ∗1 , Jeremy Heng , and Pierre E. Jacob Department of Statistics, Harvard University, USA ESSEC Business School, Singapore

Abstract

Agent-based models of disease transmission involve stochastic rules that specify how anumber of individuals would infect one another, recover or be removed from the population.Common yet stringent assumptions stipulate interchangeability of agents and that all pair-wise contact are equally likely. Under these assumptions, the population can be summarizedby counting the number of susceptible and infected individuals, which greatly facilitates sta-tistical inference. We consider the task of inference without such simplifying assumptions,in which case, the population cannot be summarized by low-dimensional counts. We designimproved particle ﬁlters, where each particle corresponds to a speciﬁc conﬁguration of thepopulation of agents, that take either the next or all future observations into account whenproposing population conﬁgurations. Using simulated data sets, we illustrate that ordersof magnitude improvements are possible over bootstrap particle ﬁlters. We also providetheoretical support for the approximations employed to make the algorithms practical.

Agent-based models also called individual-based models are used in many ﬁelds, such as socialsciences [Epstein, 2006], demographics [Hooten et al., 2020], ecology [DeAngelis and Gross, 2018]and macroeconomics [Turrell, 2016]. These models describe the time evolution of a populationaccording to a Markov process. The population comprises of N ∈ N agents who interact with oneanother through a probabilistic mechanism. The popularity of these models could be attributedto the ease of model building and interpretation, while allowing the representation of complexphenomena. Specialized software is available to facilitate their simulation and the visualizationof their output [e.g. Tisue and Wilensky, 2004].The primary use of these models seems to involve “forward simulations” using various pa-rameters corresponding to hypothetical scenarios. In comparison, relatively fewer works havetackled the question of statistical inference for such models, i.e. the estimation of model pa-rameters given available observations, also known as model calibration. Broad articles on the ∗ Corresponding author: [email protected] a r X i v : . [ s t a t . C O ] J a n uestion of ﬁtting agent-based models include Grazzini et al. [2017], Hooten et al. [2020], Hazel-bag et al. [2020] and this is a very active research area. The question of estimation is well-posed:by viewing agent-based models as a subclass of hidden Markov models, we can deﬁne the as-sociated likelihood function, from which maximum likelihood and Bayesian procedures can beenvisioned.Gibbs sampling or “data augmentation” Markov chain Monte Carlo (MCMC) methods, thatalternate between parameter and latent agent states updates are generically applicable, seeFintzi et al. [2017], Bu et al. [2020] in the context of disease transmission. However, the mixingproperties of these chains can be problematic, as we illustrate in Appendix C. Likelihood-basedinference that avoid the use of data augmentation requires one to marginalize out the latentpopulation process. This is computationally challenging as the number of possible conﬁgurationsof the population of agents grows exponentially in N . Our contribution is to design improvedparticle ﬁlters to estimate the likelihood function of agent-based models, which can then be usedfor parameter inference procedures such as simulated maximum likelihood and particle MCMC[Andrieu et al., 2010]. Particle MCMC methods, based on more standard particle ﬁlters, havebeen used for agent-based models before, see e.g. Kattwinkel and Reichert [2017]. For indirectinference and approximate Bayesian computation applied to agent-based models we refer to[Platt, 2020, van der Vaart et al., 2016, Chen et al., 2017, Sirén et al., 2018]. We limit the scope of this article to some agent-based models of disease transmission. We nextintroduce a speciﬁc model which will be used to draw clear connections between agent-basedmodels and more tractable but restrictive formulations of disease outbreak models.The state of agent n at time t , denoted by X nt , takes the value 0 or 1, corresponding to a“susceptible” or “infected” status in the context of disease transmission. We consider a closedpopulation of size N , T ∈ N time steps and a stepsize of h >

0. Initial agent states followindependent Bernoulli distributions, i.e. X n ∼ Ber( α ) , (1)for n ∈ [1 : N ] = { , . . . , N } , where we assume for the time being that each agent has thesame probability α ∈ (0 ,

1) of being infected. The state of the population X t = ( X nt ) n ∈ [1: N ] ∈{ , } N evolves according to a Markov process that depends on the interactions between theagents. The Markov transition speciﬁes that, given the previous states x t − h ∈ { , } N , the nextagent states ( X nt ) n ∈ [1: N ] are conditionally independent Bernoulli variables with probabilities( α n ( x t − h )) n ∈ [1: N ] given by α n ( x t − h ) =  hλN − P Nm =1 x mt − h , if x nt − h = 0 , − hγ, if x nt − h = 1 . (2)Here λ ∈ (0 ,

1) and γ ∈ (0 ,

1) represent infection and recovery rate parameters respectively. Anuninfected agent becomes infected with probability proportional to λ and to the proportion ofinfected individuals. An infected agent recovers with probability proportional to γ . The ratio R = λ/γ is known as the basic reproductive number, which characterizes the transmission2otential of a disease [Lipsitch et al., 2003, Britton, 2010]. The latent population process ( X t )can be related to aggregated observations ( Y t ) of the entire population. For example, the numberof infections reported at each time could be modelled as a Binomial distribution, i.e. Y t | X t = x t ∼ Bin( I ( x t ) , ρ ) , (3)where I ( x t ) = P Nn =1 x nt ∈ [0 : N ] counts the number of infected individuals and ρ ∈ (0 , N , evaluating the likelihood function with theforward algorithm would be impractical. Statistical inference is still feasible as the latent processadmits low-dimensional summaries. Since (2) assumes that all agents are interchangeable andall pairwise contact are equally likely, the dynamics depends only on the number of infectedindividuals I t = I ( X t ) and susceptible individuals S t = N − I t . More precisely, the Markovtransition for each t > S t = S t − h + N St − N It , I t = I t − h − N St + N It , (4)where N St ∼ Bin( I t − h , hγ ) and N It ∼ Bin( S t − h , hλN − I t − h ) denote the number of new recoveriesand infections, respectively. The initialization in (1) corresponds to having I ∼ Bin(

N, α ) and S = N − I . The process ( S t , I t ) is a Markov chain on the lower-dimensional state space { n, m ∈ [0 : N ] : n + m = N } , and not in { , } N . When combined with (3), the resultinghidden Markov model can be estimated using particle ﬁltering strategies; see e.g. Ionides et al.[2006], Endo et al. [2019], Whiteley and Rimella [2020].For large population sizes N , one can exploit the asymptotic properties of the dynamics(4) to further simplify inference procedures. As N → ∞ , the ﬁnite population proportions¯ S Nt = S t /N and ¯ I Nt = I t /N admit deterministic limits ¯ S t and ¯ I t , deﬁned by the recursion¯ S t = ¯ S t − h + hγ ¯ I t − hλ ¯ S t ¯ I t , ¯ I t = ¯ I t − h − hγ ¯ I t + hλ ¯ S t ¯ I t , (5)with initial condition ( ¯ S t , ¯ I t ) = (1 − α , α ). If ﬁne time resolutions are desired, one can alsoconsider the limit h →

0, in which case ( ¯ S t , ¯ I t ) converges to a deterministic continuous-timeprocess ( ¯ S ( t ) , ¯ I ( t )), deﬁned by the following system of ordinary diﬀerential equations d ¯ S ( t ) dt = γ ¯ I ( t ) − λ ¯ S ( t ) ¯ I ( t ) , d ¯ I ( t ) dt = − γ ¯ I ( t ) + λ ¯ S ( t ) ¯ I ( t ) , (6)with ( ¯ S (0) , ¯ I (0)) = (1 − α , α ). We refer to Allen and Burgin [2000] and Allen et al. [2008,Chapter 3] for formal links between stochastic and deterministic SIS models, in the form of limittheorems as N → ∞ . The dynamics in (5) or (6) combined with an observation model suchas (3) leads again to a fairly tractable statistical model, which can be estimated by maximumlikelihood or in the Bayesian framework [e.g. Osthus et al., 2019].The tractability in (4) and its limits (5)-(6) hinges critically on the assumption that all agentsare interchangeable and equally likely to be in contact. These assumptions are strong as itwould be desirable to let the infection and recovery rates depend on individual characteristics orfeatures, and the structure of interactions be speciﬁed by an undirected network. Adopting anyone of these generalizations would result in an agent-based model that cannot be summarized3y low-dimensional counts. This article is concerned with the task of inference when theseassumptions are relaxed. Our approach is to return to the hidden Markov model formulationwith latent space { , } N , and seek novel sampling algorithms that have polynomial rather thanexponential costs in N . The guiding principle of our contribution is that we can improve theperformance of Monte Carlo algorithms, sometimes dramatically, by tailoring them to the modelsat hand, and by using available observations.The rest of this article is organized as follows. In Section 2, we consider a “static” casewith a single time step to illustrate some of the challenges and proposed solutions in a simplesetting. We describe how the cost of likelihood evaluation and sampling the posterior of agentstates given observations can be reduced from exponential to polynomial in N . In Section 3,we consider the dynamic setting with multiple time steps and design auxiliary particle ﬁltersand controlled SMC algorithms tailored to heterogeneous SIS agent-based models. We alsoprovide theoretical support for the approximations employed to make the algorithms practical.We extend our methodology to susceptible-infected-recovered (SIR) models in Section 4 andhighlight the associated diﬃculties. Section 5 summarizes our ﬁndings and open questions. Thismanuscript represents preliminary work and further numerical experiments will be added in alater version. The code is available at https://github.com/nianqiaoju/agents . The proofsand more details are in appendices. For simplicity we begin with the static model X n ∼ Ber( α n ) independently for n ∈ [1 : N ],and allow each agent to be unique by modeling α n = (1 + exp( − β > w n )) − , where β ∈ R d are parameters and w n ∈ R d are the covariates of agent n . This allows individual factors toaccount for the probability of infection α = ( α n ) n ∈ [1: N ] . Given an observation y ∈ [0 : N ]modelled as (3), our inferential objective might include estimating the unknown parameters θ = ( β, ρ ) ∈ Θ and/or the latent states X = ( X n ) n ∈ [1: N ] ∈ { , } N . The complete datalikelihood is p θ ( x, y ) = f θ ( x ) g θ ( y | x ), where f θ ( x ) = N Y n =1 Ber( x n ; α n ) , g θ ( y | x ) = Bin( y ; I ( x ) , ρ ) ( I ( x ) ≥ y ) . (7)Here Ber( x n ; α n ) and Bin( y ; I ( x ) , ρ ) denote the corresponding probability mass functions (PMF).Marginalizing over the latent states yields the likelihood p θ ( y ). The agent states given theobservation and a parameter follow the posterior distribution p θ ( x | y ). In Sections 2.1 and 2.2,we examine the cost of exactly computing p θ ( y ) and sampling from p θ ( x | y ), respectively, anddescribe cheaper approximations. The gains are assessed numerically in Section 2.3. A naive approach to compute the marginal likelihood p θ ( y ) is to sum over all possible populationconﬁgurations p θ ( y ) = X x ∈{ , } N p θ ( x, y ) = X x ∈{ , } N f θ ( x ) g θ ( y | x ) . (8)4his requires O (2 N ) operations. A simple Monte Carlo approach involves sampling X ( p ) =( X ( p ) ,n ) n ∈ [1: N ] ∼ f θ independently for p ∈ [1 : P ], which represents P possible conﬁgurations ofthe population, and return the Monte Carlo estimator P − P Pp =1 g θ ( y | X ( p ) ) that weights eachconﬁguration according to the observation density. The estimator is unbiased and only costs O ( N P ) to compute, but its variance might be prohibitively large for practical values of P ,depending on the observation model g θ ( y | x ). Another issue with this estimator is that it cancollapse to zero whenever I ( X ( p ) ) < y for all p ∈ [1 : P ]. Following Del Moral et al. [2015], thiscan be circumvented by repeatedly drawing samples X ( p ) ∼ f θ independently until there are P conﬁgurations that satisfy I ( X ( p ) ) ≥ y , and return the estimator ( R − − P R − p =1 g θ ( y | X ( p ) ),where R ≥ P denotes the number of repetitions needed. The resulting estimator can be shownto be unbiased and has a random cost of O ( N E [ R ]), that depends on the value of y . An obviousshortcoming of the above estimators is that the agents are sampled from f θ without using theavailable observation y .We can in fact reduce the cost of computing (8) without resorting to Monte Carlo approxi-mations. The starting observation is that I = I ( X ) under X ∼ f θ is the sum of N independentBernoulli random variables with non-identical success probabilities, and follows a distributioncalled “Poisson Binomial”. We will refer to this distribution as PoiBin( α ) and write its PMF asPoiBin( i ; α ) = X x ∈{ , } N N X n =1 x n = i ! N Y n =1 ( α n ) x n (1 − α n ) (1 − x n ) , i ∈ [0 : N ] . (9)Exact evaluation of (9) has been considered in Barlow and Heidtmann [1984], Chen et al. [1994],Chen and Liu [1997], Hong [2013] using diﬀerent approaches. Deﬁning q ( i, n ) = PoiBin( i ; α n : N )for i ∈ [0 : N ] and n ∈ [1 : N ], we will employ the following recursion q ( i, n ) = α n q ( i − , n + 1) + (1 − α n ) q ( i, n + 1) , i ∈ [1 : N ] , n ∈ [1 : N − , (10)with initial conditions q (0 , n ) = Q Nm = n (1 − α m ) for n ∈ [1 : N ], q (1 , N ) = α N and q ( i, N ) = 0for i ∈ [2 : N ]. The desired PMF q ( i,

1) = PoiBin( i ; α ) for i ∈ [0 : N ] can thus be computed in O ( N ) operations; see Appendix A.1 for a derivation of (10).Using the above observation, we can rewrite the marginal likelihood as p θ ( y ) = N X i =0 PoiBin( i ; α )Bin( y ; i, ρ ) ( i ≥ y ) , (11)which costs O ( N ) to compute. Using a thinning argument detailed in Appendix A.2, the abovesum is in fact equal to p θ ( y ) = PoiBin( y ; ρ α ). Although the marginal likelihood will not admitsuch tractability in the general setup considered in Section 3, the preceding observations willguide our choice of approximations.One can also rely on approximations of the Poisson Binomial distribution to further reducethe cost of computing (11) to O ( N ). Choices include the Poisson approximation [Hodges andLe Cam, 1960, Barbour and Hall, 1984, Wang, 1993], the Normal approximation [Volkova, 1996]and the translated Poisson approximation [Barbour and Xia, 1999, Cekanavicius and Vaitkus,2001, Barbour and Ćekanavićius, 2002]. We will focus on the translated Poisson approximationwhich exactly matches the mean and approximately the variance of a Poisson Binomial distribu-tion. Let Poi( λ ) denote a Poisson distribution with rate λ > α ) as µ = P Nn =1 α n and σ = P Nn =1 α n (1 − α n ), respectively. The translated Poissonapproximation of (9) is given byTransPoi( i ; µ, σ ) =  , i ∈ [0 : b µ − σ c − , Poi( i − b µ − σ c ; σ + { µ − σ } ) , i ∈ [ b µ − σ c : N ] , (12)where b µ − σ c and { µ − σ } denote the ﬂoor and fractional part of µ − σ , respectively. Since µ and σ can be computed in O ( N ) operations, the translated Poisson approximation (12) andthe resulting approximation of (11) only require O ( N ) operations. Hence this can be appealingin the setting of large population sizes N at the expense of an approximation error that iswell-studied. Indeed results in Cekanavicius and Vaitkus [2001, Theorem 2.1 & Corollary 2.1]and Barbour and Ćekanavićius [2002, Theorem 3.1] imply that the error, measured in the totalvariation distance, decay at the rate of N − / as N → ∞ . Sampling from the posterior distribution p θ ( x | y ) by naively enumerating over all 2 N conﬁgura-tions is computationally impractical. A key observation is that the observation density in (7)depends on the high dimensional latent state X ∈ { , } N only through the one-dimensionalsummary I ( X ) ∈ [0 : N ]. This prompts the inclusion of I = I ( X ) as an auxiliary variable. Thusthe joint posterior distribution can be decomposed as p θ ( x, i | y ) = p θ ( i | y ) p θ ( x | i ) . (13)The dominant cost of sampling the posterior distribution of the summary p θ ( i | y ) = p θ ( i ) p θ ( y | i ) p θ ( y ) = PoiBin( i ; α )Bin( y ; i, ρ ) ( i ≥ y ) p θ ( y ) , i ∈ [0 : N ] , (14)is the evaluation of the Poisson Binomial PMF (9). Recall from Section 2.1 that this can bedone exactly in O ( N ), and in O ( N ) using a translated Poisson approximation. The conditionaldistribution of the latent state given the summary p θ ( x | i ) = p θ ( x ) p θ ( i | x ) p θ ( i ) = Q Nn =1 ( α n ) x n (1 − α n ) − x n ( I ( x ) = i )PoiBin( i ; α ) (15)is known as a conditional Bernoulli distribution, which we will write as p θ ( x | i ) = CondBer( x ; α, i ).Various sampling schemes have been proposed [Fan et al., 1962, Chen et al., 1994]; see Chenand Liu [1997, Section 4] for an overview. We will rely on the sequential decomposition p θ ( x | i ) = p θ ( x | i ) N Y n =2 p θ ( x n | x n − , i ) (16)= N − Y n =1 ( α n ) x n (1 − α n ) − x n q ( i − i n − − x n , n + 1) q ( i − i n − , n ) ( x N = i − i N − ) , where i = 0 and i n = P nm =1 x m for n ∈ [1 : N − q ( j, n ) for j ∈ [0 : i ] , n ∈ [1 : N ] needed in(16) can be precomputed using the same recursion as (10) in O ( iN ) cost. This precomputationis not necessary if these values are stored when computing Poisson Binomial probabilities.6o reduce the cost we can employ a Markov chain Monte Carlo (MCMC) method to sam-ple from the conditional Bernoulli distribution CondBer( α, i ). This incurs a cost of O (1) periteration and converges in O ( N log N ) iterations, under some mild assumptions on α , as shownin Heng et al. [2020b]. On a related note, we can design MCMC algorithms to target p θ ( x | y ).These might for example update the state of a few agents by sampling from their conditionaldistributions given every other variables, and alternately propose to swap zeros and ones in thevector X ∈ { , } N . Each of these steps can be done with a cost independent of N , but thenumber of iterations for the algorithm to converge is expected to grow at least linearly with N . We set up numerical experiments as follows. We generate covariates ( w n ) from N (4 ,

1) for N = 1000 individuals independently, where N ( µ, σ ) denotes a Normal distribution with mean µ ∈ R and variance σ >

0. Individual speciﬁc infection probabilities are computed as α n =(1 + exp( − βw n )) − using β = 0 .

3. We then simulate X n ∼ Ber( α n ) independently for all n ,and sample Y ∼ Bin( P Nn =1 X n , ρ ), with ρ = 0 .

8. We adopt a Bayesian approach and assign anindependent prior of N (0 ,

1) on β and Uniform(0 ,

1) on ρ . We focus on the task of samplingfrom the posterior distribution of θ = ( β, ρ ) given y and the covariates ( w n ).We consider random walk Metropolis–Hastings with exact likelihood calculation (“MH-exact”), associated with a quadratic cost in N . We also consider the same algorithm witha likelihood approximated by a translated Poisson (“MH-tp”), with a cost linear in N . Finallywe consider a pseudo-marginal approach [Andrieu and Roberts, 2009, Andrieu et al., 2010] withlikelihood estimated with P = 20 particles sampled from f θ (“PMMH”). Note that P = 20samples resulted in a variance of the log-likelihood estimates of approximately 0 . β and log( ρ/ (1 − ρ )) with standard deviation of 0 .

2. As a baselinewe also consider a Gibbs sampler that alternates between the updates of θ given X , employingthe same proposal as above, and updates of X given θ . These employ an equally weightedmixture of kernels, performing either N random swap updates, or a systematic Gibbs scan ofthe N components of X ; thus the cost per iteration is linear in N .We ﬁrst run “MH-exact” with 100 ,

000 MCMC iterations (after a burn-in of 5 ,

000 itera-tions), to obtain estimates of the posterior means of β and ρ . Using these posterior estimatesas ground truth, we compute the mean squared error (MSE) of the posterior approximationsobtained using each method with 20 ,

000 MCMC iterations (excluding a burn-in of 5000) and50 independent repetitions. Table 1 displays the MSE, as well as the relative wall-clock time toobtain each estimate. Comparing the results of MH-exact and MH-tp shows that it is possibleto save considerable eﬀorts at the expense of small diﬀerences in the parameter estimates usinga translated Poisson approximation. The appeal of the PMMH approach compared to exactlikelihood calculations is also clear. On the other hand, Gibbs samplers that alternate betweenthe updates of θ and X do not seem to be competitive in this example.7ethod E [ β | y ] E [ ρ | y ] Relative costBias Variance Bias VarianceMH-exact 25 93.3 0.74 6.39 128MH-tp 22 52.3 0.32 2.83 1PMMH 18 79.2 0.50 4.67 8Gibbs 1040 113 91.1 2.85 42Table 1: Bias and variance when approximating the posterior mean of β and ρ with each inferencemethod (in units of 10 − ). The rightmost column displays the average cost for each methodmeasured in terms of run-time, relative to MH-tp. We now extend the agent-based SIS model in Section 1.2. To allow for individual-speciﬁcattributes, for agent n ∈ [1 : N ], we model her initial infection probability α n , infection rate λ n and recovery rate γ n as α n = (1 + exp( − β > w n )) − , λ n = (1 + exp( − β > λ w n )) − , γ n = (1 + exp( − β > γ w n )) − . (17)Here β , β λ , β γ ∈ R d are parameters and w n ∈ R d are the covariates of agent n . The interactionsbetween agents is assumed to be known and speciﬁed by an undirected network; inference of thenetwork structure and extension to the time-varying case could be considered in future work.We will write D ( n ) and N ( n ) to denote the degree and neighbours of agent n ∈ [1 : N ].For ease of presentation, we consider time steps of size h = 1. The time evolution of thepopulation is given by X ∼ µ θ , X t | X t − = x t − ∼ f θ ( ·| x t − ) , t ∈ [1 : T ] . (18)The initial distribution µ θ ( x ) = Q Nn =1 Ber( x n ; α n ) corresponds to the static model in Sec-tion 2 with infection probabilities α = ( α n ) n ∈ [1: N ] . The Markov transition f θ ( x t | x t − ) = Q Nn =1 Ber( x nt ; α n ( x t − )) has conditional probabilities α ( x t − ) = ( α n ( x t − )) n ∈ [1: N ] given by α n ( x t − ) =  λ n D ( n ) − P m ∈N ( n ) x mt − , if x nt − = 0 , − γ n , if x nt − = 1 , (19)for n ∈ [1 : N ]. We will assume that the cost of evaluating α is O ( N ). We refer readers toMcVinish and Pollett [2012] for the asymptotic behaviour of this process as N → ∞ , in the caseof a fully connected network. In (19) we see the proportion of infected neighbours appearing inthe infection probability at each time step. As an individual can have covariates w n that includemeasures of contact frequency, which would aﬀect α n via λ n , it is possible for an individual withmany neighbours (large D ( n )) to have low infection probability if the frequency of contact islow.Equations (17)-(19) and the observation model in (3) deﬁne a hidden Markov model on { , } N , with unknown parameters θ = ( β , β λ , β γ , ρ ) ∈ Θ = R d × (0 , y T ∈ [0 : N ] T +1 , the complete data likelihood is given by p θ ( x T , y T ) = p θ ( x T ) p θ ( y T | x T ) = µ θ ( x ) T Y t =1 f θ ( x t | x t − ) T Y t =0 g θ ( y t | x t ) . (20)Parameter inference will require marginalizing over the latent process to obtain the marginallikelihood p θ ( y T ) and estimation of agent states will involve sampling from the smoothingdistribution p θ ( x T | y T ). Exact computation of the marginal likelihood and marginals of thesmoothing distribution using the forward algorithm and forward-backward algorithm, respec-tively, both require O (2 N T ) operations. As this is computationally prohibitive for large N , wewill rely on sequential Monte Carlo (SMC) approximations.In Section 3.1, we describe how SMC methods can be used to approximate the marginallikelihood and the smoothing distribution. Like many Monte Carlo schemes, the eﬃciency ofSMC crucially relies on the choice of proposal distributions. The bootstrap particle ﬁlter (BPF)[Gordon et al., 1993], which corresponds to having the joint distribution of the latent process (18)as proposal, can often give poor performance in practice when the observations are informative.By building on the insights from Section 2, we will show how the fully adapted auxiliary particleﬁlter (APF) [Pitt and Shephard, 1999, Carpenter et al., 1999] can be implemented. As theAPF constructs a proposal transition at each time step that takes the next observation inaccount, it often performs better than the BPF, although not always [Johansen and Doucet,2008]. In Section 3.2, we adapt the ideas in Guarniero et al. [2017], Heng et al. [2020a] to oursetting and present a novel controlled SMC (cSMC) method that can signiﬁcantly outperformthe APF. Central to our approach is to take the entire observation sequence y T into account byconstructing proposal distributions that approximate the smoothing distribution p θ ( x T | y T ).Using a simulated dataset, in Section 3.3 we show that cSMC provides orders of magnitudeimprovements over BPF and APF in terms of estimating the marginal likelihood. In Section 3.4,we illustrate the behaviour of marginal likelihood as the number of observations increases, andperform parameter inference and predictions. We consider MCMC strategies as alternatives toSMC-based methods in Appendix C. SMC methods [Liu and Chen, 1998, Doucet et al., 2001], also known as particle ﬁlters, provideapproximations of p θ ( y T ) and p θ ( x T | y T ) by simulating an interacting particle system of size P ∈ N . In the following, we give a generic description of SMC to include several algorithms ina common framework.At the initial time, one samples P conﬁgurations of the population from a proposal distri-bution q on { , } N , i.e. X ( p )0 = ( X ( p ) ,n ) n ∈ [1: N ] ∼ q ( ·| θ ) independently for p ∈ [1 : P ]. Eachpossible conﬁguration is then assigned a weight W ( p )0 ∝ w ( X ( p )0 ) that is normalized to sumto one. To focus our computation on the more likely conﬁgurations, we perform an operationknown as resampling that discards some conﬁgurations and duplicates others according to theirweights. Each resampling scheme involves sampling ancestor indexes ( A ( p )0 ) p ∈ [1: P ] ∈ [1 : P ] P from a distribution r ( ·| W (1: P )0 ) on [1 : P ] P . The simplest scheme is multinomial resampling[Gordon et al., 1993], which samples ( A ( p )0 ) p ∈ [1: P ] independently from the categorical distribu-tion on [1 : P ] with probabilities W (1: P )0 ; other lower variance and adaptive resampling schemes9 X X X y X X X y X X X y X X X y X X X ρ β γ β λ β Figure 1: Graphical model representation of the agent-based SIS model in Section 3 for T = 4time steps and a fully connected network with N = 3 agents.can also be employed [Fearnhead and Cliﬀord, 2003, Gerber et al., 2019]. Subsequently, for timestep t ∈ [1 : T ], one propagates each resampled conﬁguration according to a proposal transition q t on { , } N , i.e. X ( p ) t = ( X ( p ) ,nt ) n ∈ [1: N ] ∼ q t ( ·| X ( A ( p ) t − ) t − , θ ) independently for p ∈ [1 : P ]. Asbefore, we weight each new conﬁguration according to W ( p ) t ∝ w t ( X ( A ( p ) t − ) t − , X ( p ) t ), and for t < T ,resample by drawing the ancestor indexes ( A ( p ) t ) p ∈ [1: P ] ∼ r ( ·| W (1: P ) t ). For notational simplicity,we suppress notational dependence of the weight functions ( w t ) t ∈ [0: T ] on the parameter θ . Toapproximate the desired quantities p θ ( y T ) and p θ ( x T | y T ), these weight functions have tosatisfy w ( x ) T Y t =1 w t ( x t − , x t ) = µ θ ( x ) Q Tt =1 f θ ( x t | x t − ) Q Tt =0 g θ ( y t | x t ) q ( x | θ ) Q Tt =1 q t ( x t | x t − , θ ) . (21)Given the output of the above simulation, an unbiased estimator of the marginal likelihood p θ ( y T ) is ˆ p θ ( y T ) =  P P X p =1 w ( X ( p )0 )   T Y t =1 P P X p =1 w t ( X ( A ( p ) t − ) t − , X ( p ) t )  , (22)and a particle approximation of the smoothing distribution is given byˆ p θ ( x T | y T ) = P X p =1 W ( p ) T δ X ( p )0: T ( x T ) . (23)In the latter approximation, each trajectory X ( p )0: T is formed by tracing the ancestral lineageof X ( p ) T , i.e. X ( p )0: T = ( X ( B ( p ) t ) t ) t ∈ [0: T ] with B ( p ) T = p and B ( p ) t = A ( B ( p ) t +1 ) t for t ∈ [0 : T − P → ∞ arewell-studied; see for example Del Moral [2004]. However, the quality of these approximationsdepends crucially on the choice of proposals ( q t ) t ∈ [0: T ] and the corresponding weight functions( w t ) t ∈ [0: T ] that satisfy (21).The BPF of Gordon et al. [1993] can be recovered by employing the proposals q ( x | θ ) = µ θ ( x ) , q t ( x t | x t − , θ ) = f θ ( x t | x t − ) for t ∈ [1 : T ] and the weight functions w t ( x t ) = g θ ( y t | x t )for t ∈ [0 : T ]. Although the BPF only costs O ( N T P ) to implement and has convergenceguarantees as P → ∞ , the variance of its marginal likelihood estimator can be too large todeploy within particle MCMC schemes for practical values of P (see Section 3.3). Another issuewith its marginal likelihood estimator, for this particular choice of observation equation, is thatit can collapse to zero if all proposed conﬁgurations have less infections than the observed value,i.e. there exists t ∈ [0 : T ] such that I ( X ( p ) t ) < y t for all p ∈ [1 : P ]. With increased cost, thisissue can be circumvented using the alive particle ﬁlter of Del Moral et al. [2015], by repeatedlydrawing samples at each time step until there are P conﬁgurations with infections that are largerthan or equal to the observed value.Alternatively, one can construct better proposals with supports that respect these observa-tional constraints. One such option is the fully adapted APF [Pitt and Shephard, 1999, Carpen-ter et al., 1999] that corresponds to having the proposals q ( x | θ ) = p θ ( x | y ) , q t ( x t | x t − , θ ) = p θ ( x t | x t − , y t ) for t ∈ [1 : T ] and the weight functions w ( x ) = p θ ( y ) and w t ( x t − ) = p θ ( y t | x t − )for t ∈ [1 : T ]. At the initial time, computing the marginal likelihood p θ ( y ) and sampling fromthe posterior of agent states p θ ( x | y ) can be done exactly (or approximately) as described inSections 2.1 and 2.2 respectively for the static model. More precisely, we compute p θ ( y ) = N X i =0 PoiBin( i ; α )Bin( y ; i , ρ ) ( i ≥ y ) (24)and sample from p θ ( x , i | y ) = p θ ( i | y ) p θ ( x | i ) = PoiBin( i ; α )Bin( y ; i , ρ ) ( i ≥ y ) p θ ( y ) CondBer( x ; α , i ) , (25)which admits p θ ( x | y ) as its marginal distribution. For time step t ∈ [1 : T ], by conditioning onthe previous conﬁguration x t − ∈ { , } N , the same ideas can be used to compute the predictivelikelihood p θ ( y t | x t − ) and sample from the transition p θ ( x t | x t − , y t ), i.e. we compute p θ ( y t | x t − ) = N X i t =0 PoiBin( i t ; α ( x t − ))Bin( y t ; i t , ρ ) ( i t ≥ y t ) (26)and sample from p θ ( x t , i t | x t − , y t ) = p θ ( i t | x t − , y t ) p θ ( x t | x t − , i t ) (27)= PoiBin( i t ; α ( x t − ))Bin( y t ; i t , ρ ) ( i t ≥ y t ) p θ ( y t | x t − ) CondBer( x t ; α ( x t − ) , i t )which admits p θ ( x t , | x t − , y t ) as its marginal transition.An algorithmic description of the resulting APF is detailed in Algorithm 1, where the no-tation Cat([0 : N ] , V (0: N ) ) refers to the categorical distribution on [0 : N ] with probabilities11 (0: N ) = ( V ( i ) ) i ∈ [0: N ] . As the weights in the fully adapted APF at time t ∈ [1 : T ] only de-pend on the conﬁguration at time t −

1, note that we have interchanged the order of samplingand resampling to promote sample diversity. The cost of running APF exactly is O ( N T P ).To reduce the computational cost to O ( N log( N ) T P ) one can approximate the above Poissonbinomial PMFs with the translated Poisson approximation (12), and employ MCMC to samplefrom the above conditioned Bernoulli distributions.

Algorithm 1:

Auxiliary particle ﬁlter for SIS model

Input:

Parameters θ ∈ Θ and number of particles P ∈ N compute v ( i )0 = PoiBin( i ; α )Bin( y ; i, ρ ) ( i ≥ y ) for i ∈ [0 : N ]set w = P Ni =0 v ( i )0 normalize V ( i )0 = v ( i )0 /w for i ∈ [0 : N ]sample I ( p )0 ∼ Cat([0 : N ] , V (0: N )0 ) and X ( p )0 | I ( p )0 ∼ CondBer( α , I ( p )0 ) for p ∈ [1 : P ] for t = 1 , · · · , T and p = 1 , · · · , P do compute v ( i,p ) t = PoiBin( i ; α ( X ( p ) t − ))Bin( y t ; i, ρ ) ( i ≥ y t ) for i ∈ [0 : N ]set w ( p ) t = P Ni =0 v ( i,p ) t normalize V ( i,p ) t = v ( i,p ) t /w ( p ) t for i ∈ [0 : N ]normalize W ( p ) t = w ( p ) t / P Pk =1 w ( k ) t sample A ( p ) t − ∼ r ( ·| W (1: P ) t )sample I ( p ) t ∼ Cat([0 : N ] , V (0: N,A ( p ) t − ) t ) and X ( p ) t | I ( p ) t ∼ CondBer( α ( X ( A ( p ) t − ) t − ) , I ( p ) t ) endOutput: Marginal likelihood estimator ˆ p θ ( y T ) = w Q Tt =1 P − P Pp =1 w ( p ) t , states( X ( p ) t ) ( t,p ) ∈ [0: T ] × [1: P ] and ancestors ( A ( p ) t ) ( t,p ) ∈ [0: T − × [1: P ] To obtain better performance than APF, we can construct proposals that take not just thenext but all future observations into account. We can sequentially decompose the smoothingdistribution as p θ ( x T | y T ) = p θ ( x | y T ) T Y t =1 p θ ( x t | x t − , y t : T ) , (28)and it follows that the optimal proposal is q ? ( x | θ ) = p θ ( x | y T ) and q ?t ( x t | x t − , θ ) = p θ ( x t | x t − , y t : T )for t ∈ [1 : T ], as this gives exact samples from the smoothing distribution. The resulting SMCmarginal likelihood estimator in (22) would have zero variance for any choice of weight func-tions satisfying (21). To design approximations of the optimal proposal, it will be instructive torewrite it as p θ ( x | y T ) = µ θ ( x ) ψ ? ( x ) µ θ ( ψ ? ) , p θ ( x t | x t − , y t : T ) = f θ ( x t | x t − ) ψ ?t ( x t ) f θ ( ψ ?t | x t − ) , t ∈ [1 : T ] , (29)where ψ ?t ( x t ) = p ( y t : T | x t ) for t ∈ [0 : T ], µ θ ( ψ ? ) = P x ∈{ , } N µ θ ( x ) ψ ? ( x ) denotes the ex-pectation of ψ ? with respect to µ θ and f θ ( ψ ?t | x t − ) = P x t ∈{ , } N f θ ( x t | x t − ) ψ ?t ( x t ) denotes theconditional expectation of ψ ?t under f θ . Equation (29) shows how the latent process (18), de-ﬁned by µ θ and f θ , should be modiﬁed to obtain the optimal proposal. The functions ( ψ ?t ) t ∈ [0: T ] ,12nown as the backward information ﬁlter (BIF) [Bresler, 1986, Briers et al., 2010], can be deﬁnedusing the backward recursion ψ ?T ( x T ) = g θ ( y T | x T ) , ψ ?t ( x t ) = g θ ( y t | x t ) f θ ( ψ ?t +1 | x t ) , t ∈ [0 : T − , (30)which shows how information from future observations are propagated backwards over time. Asthe cost of computing and storing the BIF using the recursion (30) are O (2 N T ) and O (2 N T ),respectively, approximations are necessary when N is large. In contrast to Guarniero et al. [2017],Heng et al. [2020a] that rely on regression techniques to approximate the BIF, our approach isbased on dimensionality reduction by coarse-graining the agent-based model.At the terminal time T , the function ψ ?T ( x T ) = Bin( y T ; I ( x T ) , ρ ) ( I ( x T ) ≥ y T ) only dependson the agent states x T ∈ { , } N through the one-dimensional summary I ( x T ) ∈ [0 : N ].Therefore it suﬃces to compute and store ψ T ( i T ) = Bin( y T ; i T , ρ ) ( i T ≥ y T ) for all i T ∈ [0 : N ]. Note that ψ T ( I ( x T )) should be seen as a function of the agent states. To iteratethe backward recursion (30), we have to compute the conditional expectation f θ ( ψ T | x T − ) = P Ni T =0 PoiBin( i T ; α ( x T − )) ψ T ( i T ) for all x T − ∈ { , } N . By a thinning argument (AppendixA.2), this is equal to PoiBin( y T ; ρ α ( x T − )). Hence it is clear that all 2 N possible conﬁgurationsof the population have to be considered to iterate the recursion, and an approximation of α ( x T − )is necessary at this point.To reduce dimension, we consider¯ α n ( x T − ) =  ¯ λN − I ( x T − ) , if x nT − = 0 , − ¯ γ, if x nT − = 1 . (31)This amounts to approximating the proportion of infected neighbours of each agent by the pop-ulation proportion of infections, and replacing individual infection and recovery rates in (19)by their population averages, i.e. λ n ≈ ¯ λ = N − P Nn =1 λ n and γ n ≈ ¯ γ = N − P Nn =1 γ n . Writ-ing ¯ f θ ( x T | x T − ) = Q Nn =1 Ber( x nT ; ¯ α n ( x T − )) as the Markov transition associated to (31), weapproximate the conditional expectation f θ ( ψ T | x T − ) by¯ f θ ( ψ T | I ( x T − )) = N X i T =0 SumBin( i T ; N − I ( x T − ) , ¯ λN − I ( x T − ) , I ( x T − ) , − ¯ γ ) ψ T ( i T ) , (32)where SumBin( N , p , N , p ) denotes the distribution of a sum of two independent Bin( N , p )and Bin( N , p ) random variables. This follows as a Poisson binomial distribution with homo-geneous probabilities (31) reduces to the SumBin distribution in (32), which is not analyticallytractable but can be computed exactly in O ( N ) cost using a naive implementation of the con-volution . In the large N regime, we advocate approximating the SumBin distribution withthe translated Poisson (12), which reduces the cost to O ( N ) at the price of an approximationerror. Moreover, the latter bias only aﬀects the quality of our proposal distributions, and notthe consistency properties of SMC approximations. The resulting approximation of ψ ?T − ( x T − )can be computed using ψ T − ( i T − ) = Bin( y T − ; i T − , ρ ) ( i T − ≥ y T − ) ¯ f θ ( ψ T | i T − ) (33) SumBin( i ; N , p , N , p ) = P ij =0 Bin( j ; N , p )Bin( i − j ; N , p ) for i ∈ [0 : N + N ]. i T − ∈ [0 : N ]. This costs O ( N ) if convolutions are implemented naively and O ( N )if translated Poisson approximations of (32) are employed. We then continue in the samemanner to approximate the recursion (30) until the initial time. Algorithm 2 summarizes ourapproximation of the BIF ( ψ t ) t ∈ [0: T ] , which costs O ( N T ) or O ( N T ) to compute and O ( N T )in storage.Our corresponding approximation of the optimal proposal (29) is q ( x | θ ) = µ θ ( x ) ψ ( I ( x )) µ θ ( ψ ) , q t ( x t | x t − , θ ) = f θ ( x t | x t − ) ψ t ( I ( x t )) f θ ( ψ t | x t − ) , t ∈ [1 : T ] . (34)To employ these proposals within SMC, the appropriate weight functions [Scharth and Kohn,2016] satisfying (21) are w ( x ) = µ θ ( ψ ) g θ ( y | x ) f θ ( ψ | x ) ψ ( I ( x )) , w t ( x t ) = g θ ( y t | x t ) f θ ( ψ t +1 | x t ) ψ t ( I ( x t )) , t ∈ [1 : T − , (35)and w T ( x T ) = 1. To evaluate the weights, note that expectations can be computed as µ θ ( ψ ) = N X i =0 PoiBin( i ; α ) ψ ( i ) , f θ ( ψ t | x t − ) = N X i t =0 PoiBin( i t ; α ( x t − )) ψ t ( i t ) , t ∈ [1 : T ] . (36)Sampling from the proposals in (34) can be performed in a similar manner as the APF. Toinitialize, we sample from q ( x , i | θ ) = q ( i | θ ) q ( x | i , θ ) = PoiBin( i ; α ) ψ ( i ) µ θ ( ψ ) CondBer( x ; α , i ) (37)which admits q ( x | θ ) as its marginal distribution, and for time t ∈ [1 : T ] q t ( x t , i t | x t − , θ ) = q t ( i t | x t − , θ ) q t ( x t | x t − , i t , θ ) (38)= PoiBin( i t ; α ( x t − )) ψ t ( i t ) f θ ( ψ t | x t − ) CondBer( x t ; α ( x t − ) , i t )which admits q t ( x t | x t − , θ ) as its marginal transition. Algorithm 3 gives an algorithmic summaryof the resulting SMC method, which we shall refer to as controlled SMC (cSMC), following theterminology of Heng et al. [2020a]. This has the same cost as the APF, and one can also employtranslated Poisson approximations (12) and MCMC to reduce the computational cost.To study the performance of cSMC, we consider the Kullback–Leibler (KL) divergence fromour proposal distribution q θ ( x T ) = q ( x | θ ) Q Tt =1 q t ( x t | x t − , θ ) to the smoothing distribution p θ ( x T | y T ), denoted as KL ( p θ ( x T | y T ) | q θ ( x T )), which characterizes the quality of ourimportance proposal [Chatterjee and Diaconis, 2018]. The following result provides a decompo-sition of this KL divergence in terms of logarithmic diﬀerences between the BIF ( ψ ?t ) t ∈ [0: T ] andour approximation ( ψ t ) t ∈ [0: T ] under the marginal distributions of the smoothing distribution andour proposal distribution, denoted as η ?t ( x t | θ ) and η t ( x t | θ ) respectively for each time t ∈ [0 : T ].Given a function ϕ : { , } N → R , we will write η ?t ( ϕ | θ ) and η t ( ϕ | θ ) to denote expectationsunder these marginal distributions, and its corresponding L -norms as k ϕ k L ( η ?t ) = η ?t ( ϕ | θ ) / and k ϕ k L ( η t ) = η t ( ϕ | θ ) / . 14 roposition 3.1. The Kullback–Leibler divergence from q θ ( x T ) to p θ ( x T | y T ) satisﬁesKL ( p θ ( x T | y T ) | q θ ( x T )) ≤ T X t =0 η ?t (log( ψ ?t /ψ t ) | θ ) + M t − η t (log( ψ t /ψ ?t ) | θ ) , (39)where M − = 1 and M t = max x t ∈{ , } N η ?t ( x t | θ ) /η t ( x t | θ ) for t ∈ [0 : T − M t ) t ∈ [0: T − are ﬁnite by upper bounding the weight functions in (35). The next result characterizes the errorof our BIF approximation measured in terms of the KL upper bound in (39). Proposition 3.2.

For each time t ∈ [0 : T ], the BIF approximation in Algorithm 2 satisﬁes: η ?t (log( ψ ?t /ψ t ) | θ ) ≤ T − X k = t c ?θ ( ψ k +1 ) k φ ?θ k L ( η ?k ) k ∆ θ k L ( η ?k ) , (40) η t (log( ψ t /ψ ?t ) | θ ) ≤ T − X k = t c θ ( ψ k +1 ) k φ θ k L ( η k ) k ∆ θ k L ( η k ) , (41)with constants c ?θ ( ψ k ) = 1min x k − ∈{ , } N \ (0 ,..., f θ ( ψ k | x k − ) < ∞ , c θ ( ψ k ) = 1min i k − ∈ [1: N ] ¯ f θ ( ψ k | i k − ) } < ∞ for k ∈ [1 : T ], functions ∆ θ ( x ) = P Nn =1 | ¯ α n ( x ) − α n ( x ) | , φ ?θ ( x ) =  X i ∈ [0: N ] PoiBin( i ; α ( x )) PoiBin( i ; ¯ α ( x ))  / , φ θ ( x ) =  X i ∈ [0: N ] PoiBin( i ; ¯ α ( x )) PoiBin( i ; α ( x ))  / (42)for x ∈ { , } N , and the convention that P pk = t {·} = 0 for p < t .The proof of Proposition 3.2 in Appendix B.3 derives recursive bounds of the approximationerrors. The crux of our arguments is to upper bound the error of taking conditional expectationsunder the homogeneous probabilities (31). This relies on an upper bound of the Kullback–Leibler divergence between two Poisson binomial distributions that is established in AppendixA.3, which may be of independent interest. If conditional expectations are further approximatedusing the translated Poisson approximations, one can also employ the results by Cekanaviciusand Vaitkus [2001], Barbour and Ćekanavićius [2002] to incorporate these errors in our analysis.The error bounds in (40) and (41) show how the accuracy of the BIF approximation dependon our approximation of the success probability α ( x ) via the term ∆ θ ( x ). If we decompose∆ θ ( x ) ≤ ∆ G θ ( x ) + ∆ λθ + ∆ γθ , where∆ G θ ( x ) = N X n =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N − I ( x ) − D ( n ) − X m ∈N ( n ) x m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , ∆ λθ = N X n =1 | ¯ λ − λ n | , ∆ γθ = N X n =1 | ¯ γ − γ n | , (43)we see the eﬀect of coarse-graining the agent-based model in (31). In Appendix B.4, we discusshow to reduce the errors ∆ λθ ( x ) and ∆ γθ ( x ). By adopting a more ﬁne-grained approximation basedon clustering of the infection and recovery rates, one can obtain more accurate approximationsat the expense of increased computational cost.15 lgorithm 2: Backward information ﬁlter approximation for SIS model

Input:

Parameters θ ∈ Θcompute average infection and recovery rates ¯ λ = N − P Nn =1 λ n , ¯ γ = N − P Nn =1 γ n compute ψ T ( i T ) = Bin( y T ; i T , ρ ) ( i T ≥ y T ) for i T ∈ [0 : N ] for t = T − , · · · , and i t = 0 , · · · , N do compute ¯ f θ ( ψ t +1 | i t ) = P Ni t +1 =0 SumBin( i t +1 ; N − i t , ¯ λN − i t , i t , − ¯ γ ) ψ t +1 ( i t +1 )compute ψ t ( i t ) = Bin( y t ; i t , ρ ) ( i t ≥ y t ) ¯ f θ ( ψ t +1 | i t ) endOutput: Approximate BIF ( ψ t ) t ∈ [0: T ] Algorithm 3:

Controlled sequential Monte Carlo for SIS model

Input:

Parameters θ ∈ Θ, approximate BIF ( ψ t ) t ∈ [0: T ] and number of particles P ∈ N compute probabilities v ( i )0 = PoiBin( i ; α ) ψ ( i ) for i ∈ [0 : N ]set E = P Ni =0 v ( i )0 and normalize probabilities V ( i )0 = v ( i )0 /E for i ∈ [0 : N ]sample I ( p )0 ∼ Cat([0 : N ] , V (0: N )0 ) and X ( p )0 | I ( p )0 ∼ CondBer( α , I ( p )0 ) for p ∈ [1 : P ]compute weights w ( p )0 = w ( X ( p )0 ) using (35) for p ∈ [1 : P ] for t = 1 , · · · , T and p = 1 , · · · , P do normalize weights W ( p ) t − = w ( p ) t − / P Pk =1 w ( k ) t − sample ancestors A ( p ) t − ∼ r ( ·| W (1: P ) t − )compute probabilities v ( i,p ) t = PoiBin( i ; α ( X ( A ( p ) t − ) t − )) ψ t ( i ) for i ∈ [0 : N ]set E ( p ) t = P Ni =0 v ( i,p ) t and normalize probabilities V ( i,p ) t = v ( i,p ) t /E ( p ) t for i ∈ [0 : N ]sample I ( p ) t ∼ Cat([0 : N ] , V (0: N,p ) t ) and X ( p ) t | I ( p ) t ∼ CondBer( α ( X ( A ( p ) t − ) t − ) , I ( p ) t )compute weights w ( p ) t = w t ( X ( p ) t ) using (35) endOutput: Marginal likelihood estimator ˆ p θ ( y T ) = Q Tt =0 P − P Pp =1 w ( p ) t , states( X ( p ) t ) ( t,p ) ∈ [0: T ] × [1: P ] and ancestors ( A ( p ) t ) ( t,p ) ∈ [0: T − × [1: P ] .3 Numerical illustration of sequential Monte Carlo methods We now illustrate the behaviour of the above SMC methods on simulated data. We considera population of N = 100 agents that are fully connected for T = 90 time steps. The agentcovariates w n = ( w n , w n ) are taken as w n = 1 and sampled from w n ∼ N (0 ,

1) independentlyfor n ∈ [1 : N ]. Given these covariates, we simulate data from model (17)-(19) and (3) with theparameters β = ( − log( N − , β λ = ( − , β γ = ( − , −

1) and ρ = 0 . t ∈ [0 : T ], deﬁned in terms of the normalized weights( W ( p ) t ) p ∈ [1: P ] as 1 / P Pp =1 ( W ( p ) t ) , measures the adequacy of the importance sampling approxi-mation at each step. We consider two implementations of the controlled SMC in Algorithm 3:cSMC1 employs proposal distributions that are deﬁned by the BIF approximation in Algorithm2, while cSMC2 relies on translated Poisson approximations of the SumBin distributions in Al-gorithm 2 to lower the computational cost. This lowers the run-time of the BIF approximationfrom 0 .

35 to 0 .

12 second, which are insigniﬁcant relative to the cost of cSMC for a populationsize of N = 100. As expected, the APF performs better than the BPF by taking the next ob-servation into account, and cSMC does better than APF by incorporating information from theentire observation sequence. Furthermore, the faster BIF approximation does not result in a no-ticeable loss of cSMC performance due to the accuracy of the translated Poisson approximationswith N = 100 agents. Although the performance of BPF seems adequate in this simulated datasetting, the middle and bottom panels of Figure 2 reveal that its particle approximation cancollapse whenever there are smaller or larger observation counts. In contrast, the performanceof APF and cSMC appear to be more robust to such informative observations. o r i g i na l s m a lll a r ge ESS % method BPF APF cSMC1 cSMC2 Figure 2: Eﬀective sample size of various SMC methods with P = 512 particles when consid-ering simulated observations ( top ), when observations at t ∈ { , , } are replaced by b y t / c ( middle ), or min(2 y t , N ) ( bottom ).Next we examine the performance of these SMC methods in terms of marginal likelihood17 PF APF cSMC1 cSMC2 P DGP Cost DGP Non-DGP Cost DGP Non-DGP DGP Non-DGP CostVar (sec) Var Var (sec) Var Var Var Var (sec)64 4.32 0.09 0.281 71.88 1.49 0.0696 13.52 0.0779 18.44 1.46128 2.39 0.17 0.154 39.52 2.95 0.0285 8.62 0.0382 8.48 2.88256 1.67 0.33 0.110 26.86 5.85 0.0164 4.11 0.0190 6.88 5.72512 0.88 0.63 0.056 18.98 11.72 0.0087 3.57 0.0105 5.03 11.411024 0.55 1.25 0.026 13.38 23.46 0.0049 2.05 0.0046 3.41 22.832048 0.31 2.49 0.011 9.93 47.48 0.0020 1.15 0.0027 2.07 45.97

Table 2: Variance of log-marginal likelihood estimator and its average cost (in seconds) usingdiﬀerent number of particles P and various SMC methods. The parameter sets considered are thedata generating parameters (DGP) and a modiﬁcation of the DGP with β λ = ( − ,

0) (labelledas non-DGP). The cost of cSMC1 and cSMC2 are comparable and the run-times to computethe BIF approximations (0.35 and 0.12 second for cSMC1 and cSMC2) are not reported in thistable.estimation. Table 2 displays the variance of the log-marginal likelihood estimator at two param-eter sets, and its average cost measured as run-time that were estimated using 100 independentrepetitions of each method. At the DGP, it is apparent that the cSMC estimators achieve theasymptotic regime of P → ∞ earlier than BPF and APF, which seem to require at least P = 512particles. Based on the largest value of P that we considered, the asymptotic variance of APF,cSMC1 and cSMC2 was found to be 29, 155 and 115 times smaller relative to BPF, respectively.As the cost of APF and cSMC was approximately 19 times more expensive than BPF in ourimplementation, we see that APF, cSMC1 and cSMC2 are respectively 1 .

5, 8 and 6 times moreeﬃcient than BPF at the DGP. We can expect these eﬃciency gains to be more signiﬁcant aswe move away from the DGP. To illustrate this, we consider another parameter set which has β λ = ( − ,

0) and keeps all other parameters at the DGP. Although this is a less likely set ofparameters as the log-marginal likelihood is approximately 232 lower than the DGP, adequatemarginal likelihood estimation is crucial when employing SMC methods within particle MCMCalgorithms for parameter inference. In this case, we found that the BPF marginal likelihoodestimates could collapse to zero for the values of P that are considered in Table 2. In contrast,APF and cSMC would not suﬀer from this issue by construction. By increasing the number ofBPF particles to P = 262 ,

144 and comparing its performance to APF, cSMC1 and cSMC2 with P = 2048 particles, we ﬁnd that BPF is respectively 9, 76 and 42 times less eﬃcient at thisparameter set. Lastly, Figure 3 illustrates the comparison of SMC methods as the parameter ρ varies and all other parameters ﬁxed at the DGP. We ﬁrst concern ourselves with the behaviour of the marginal likelihood and the maximumlikelihood estimator (MLE) arg max θ ∈ Θ p θ ( y T ) as the number of observations T → ∞ . This isillustrated with our running simulated dataset from Section 3.3. Figure 4 plots the log-likelihoodas a function of β λ = ( β λ , β λ ) or ( β λ , β γ ) with the other parameters ﬁxed at their data generatingvalues, estimated using cSMC with P = 64 particles. These plots reveal the complex behavior18 e−021e−011e+001e+011e+02 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 r V a r i an c e method BPF APF cSMC1 cSMC2 0204060 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 r R e l a t i v e e ff i c i en cy method BPF APF cSMC1 cSMC2 Figure 3: Variance of log-marginal likelihood estimator ( left ) and relative eﬃciency ( right ) ofvarious SMC methods compared to the BPF with P = 512 particles, as ρ varies and the otherparameters ﬁxed at the DGP.of the likelihood functions induced by agent-based SIS models. Furthermore, we see that thelikelihood concentrates more around the DGP as T increases, and that the MLE can recover theDGP when T is suﬃciently large.By building on the SMC methods in Sections 3.1 and 3.2, one can construct a stochasticgradient ascent scheme or an expectation-maximization algorithm to approximate the MLE.We refer readers to Kantas et al. [2015] for a comprehensive review of such approaches. In theBayesian framework, our interest is on the posterior distribution p ( θ, x T | y T ) = p ( θ | y T ) p θ ( x T | y T ) ∝ p ( θ ) p θ ( x T , y T ) , (44)where p ( dθ ) = p ( θ ) dθ is a given prior distribution on the parameter space Θ. We employ particlemarginal Metropolis–Hastings (PMMH) to sample from the posterior distribution. Followingthe discussion in Section 3.3, we will choose cSMC to construct a more eﬃcient PMMH chain.We now illustrate our inference method on the simulated data setup of Section 3.3. We adopta prior distribution that assumes the parameters are independent with β , β λ , β γ ∼ N (0 ,

9) and ρ ∼ Uniform(0 , β , β λ , β γ , log( ρ/ (1 − ρ )))-space, with a standard deviation of 0 . P = 128 particles in the cSMC algorithm within PMMHand we run the PMMH algorithm for 100 ,

000 iterations after a burn in of 5000 iterations.Using these posterior samples, we infer the ratios R n = λ n /γ n , which can be understood as thereproductive number of agent n ∈ [1 : N ]. In the left panel of Figure 5, we display the estimatedposterior medians and 95% credible sets, as well as the data generating values. Although theposterior median estimates are similar across agents, there is large posterior uncertainty foragents with small or large data generated ratios. To visualize how ( R n ) n ∈ [1: N ] is distributed inthe population, we estimate histograms that take parameter uncertainty into account, illustratethe results in the right panel of Figure 5. The posterior median estimates yields a histogramthat is more concentrated than its data generating counterpart.Lastly, we examine the predictive performance of the model when relatively few observationsare available. As illustrated in Figure 6, we assume access to the ﬁrst t = 30 observations (black19 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll T = 10 T = 30 T = 90 −4 0 4 −4 0 4 −4 0 4−404 b l bl −100 −20 −10−5 0log−likelihood llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll T = 10 T = 30 T = 90 −4 0 4 −4 0 4 −4 0 4−404 b l bg −100 −20 −10−5 0log−likelihood Figure 4: Estimated log-likelihood as a function of β λ = ( β λ , β λ ) ( ﬁrst row ) or ( β λ , β γ ) ( secondrow ) with the other parameters ﬁxed at the DGP given t ∈ { , , } observations. For ease ofplotting, the log-likelihood values were translated so that the MLE ( red dot ) attains a maximumof zero. The data generating parameters of β λ = ( − ,

2) or ( β λ , β γ ) = (2 , −

1) are shown as ablack dot. 20 lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll −10010 0 25 50 75 100sorted agent index l og ( R ) ( R ) p r opo r t i on Figure 5: Posterior estimates of the reproductive number R n = λ n /γ n of each agent n ∈ [1 : N ]( left ) and its distribution in the population ( right ). Posterior medians are shown in blue andthe 95% posterior credible sets in light blue. Data generating values are illustrated in black.dots) and predict the rest of the time series up to time T = 90 (grey dots) using the posteriorpredictive distribution p ( y t +1: T | y t ) = R Θ P x t ∈{ , } N p θ ( y t +1: T | x t ) p ( θ, x t | y t ). By simulatingtrajectories from the posterior predictive (grey lines), we obtain the model predictions anduncertainty estimates in Figure 6. llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ob s e r v a t i on Figure 6: Observation sequence plotted as dots that are coloured initially in black followedby grey from time t = 30 to T = 90. Some sampled trajectories from the posterior predictivedistribution p ( y t +1: T | y t ) are depicted as grey lines. The blue dashed line shows the medians overtime; the lower and upper light blue dashed lines correspond to the 2 .

5% and 97 .

5% quantilesrespectively. 21

Susceptible-Infected-Recovered model

We consider a susceptible-infected-recovered (SIR) model, where agents become immune to apathogen after they recover from an infection. The “recovered” status of an agent shall beencoded by a state of 2. Given a population conﬁguration x = ( x n ) n ∈ [1: N ] ∈ { , , } N , we willwrite S n ( x ) = ( x n = 0), I n ( x ) = ( x n = 1), R n ( x ) = ( x n = 2) to indicate the status ofagent n ∈ [1 : N ], and S ( x ) = P Nn =1 S n ( x ), I ( x ) = P Nn =1 I n ( x ), R ( x ) = P Nn =1 R n ( x ) to countthe number of agents in each state. Under the assumption of a closed population, we have S ( x ) + I ( x ) + R ( x ) = N .The time evolution of the population ( X t ) t ∈ [0: T ] is now modelled by a Markov chain on { , , } N , i.e. the speciﬁcation in (18) with µ θ ( x ) = N Y n =1 Cat( x n ; [0 : 2] , α n ) , f θ ( x t | x t − ) = N Y n =1 Cat( x nt ; [0 : 2] , α n ( x t − )) , t ∈ [1 : T ] . (45)The above probabilities are given by α n = ( α n ,S , α n ,I , α n ,R ) = (1 − α n ,I , α n ,I ,

0) and α n ( x t − ) =( α nS ( x t − ) , α nI ( x t − ) , α nR ( x t − )) where α nS ( x t − ) = S n ( x t − )  − λ n D ( n ) − X m ∈N ( n ) I m ( x t − )  ,α nI ( x t − ) = S n ( x t − ) λ n D ( n ) − X m ∈N ( n ) I m ( x t − ) + I n ( x t − )(1 − γ n ) , (46) α nR ( x t − ) = I n ( x t − ) γ n + R n ( x t − ) , which satisﬁes α nS ( x t − ) + α nI ( x t − ) + α nR ( x t − ) = 1. Note that α nR ( x t − ) = 1 if R n ( x t − ) =1, which reﬂects the above-mentioned immunity. Just like the SIS model, the agents’ initialinfection probabilities, infection and recovery rates ( α n ,I , λ n , γ n ) n ∈ [1: N ] are speciﬁed with (17),and the observation model for the number of infections reported over time ( Y t ) t ∈ [0: T ] is (3).We consider again SMC approximations of the marginal likelihood and smoothing distribu-tion of the resulting hidden Markov model on { , , } N . The BPF can be readily implementedin O ( N T P ) cost, but suﬀers from the diﬃculties discussed in Section 3.1. To obtain betterperformance, we discuss how to implement the fully adapted APF in Section 4.1 and adapt ourcSMC construction in Section 4.2. Alternative MCMC approaches such as those detailed inAppendix C could also be considered.

We recall that implementing the fully adapted APF requires one to sample from proposals q ( x | θ ) = p θ ( x | y ) , q t ( x t | x t − , θ ) = p θ ( x t | x t − , y t ) for t ∈ [1 : T ], and evaluate the weightfunctions w ( x ) = p θ ( y ) and w t ( x t − ) = p θ ( y t | x t − ) for t ∈ [1 : T ]. At the initial time, wecompute the marginal likelihood p θ ( y ) as p θ ( y ) = N X i =0 PoiBin( i ; α ,I )Bin( y ; i , ρ ) ( i ≥ y ) (47)22here α ,I = ( α n ,I ) n ∈ [1: N ] , and sample from p θ ( x , i , i N | y ) = p θ ( i | y ) p θ ( i N | i ) p θ ( x | i N ) (48)= PoiBin( i ; α ,I )Bin( y ; i , ρ ) ( i ≥ y ) p θ ( y ) CondBer( i N ; α ,I , i ) N Y n =1 Cat( x n ; [0 : 2] , ν ( i n ))with ν ( i n ) = (1 − i n , i n , p θ ( x | y ) as its marginaldistribution. Similarly, for time step t ∈ [1 : T ], we compute the predictive likelihood p θ ( y t | x t − )as p θ ( y t | x t − ) = N X i t =0 PoiBin( i t ; α I ( x t − ))Bin( y t ; i t , ρ ) ( i t ≥ y t ) (49)where α I ( x t − ) = ( α nI ( x t − )) n ∈ [1: N ] , and sample from p θ ( x t , i t , i Nt | x t − , y t ) = p θ ( i t | x t − , y t ) p θ ( i Nt | x t − , i t ) p θ ( x t | x t − , i Nt ) (50)= PoiBin( i t ; α I ( x t − ))Bin( y t ; i t , ρ ) ( i t ≥ y t ) p θ ( y t | x t − ) CondBer( i Nt ; α I ( x t − ) , i t ) N Y n =1 Cat( x nt ; [0 : 2] , ν ( x t − , i nt ))with ν ( x t − , i nt ) = ((1 − i nt ) S n ( x t − ) , i nt , (1 − i nt ) { I n ( x t − )+ R n ( x t − ) } ), which admits p θ ( x t | x t − , y t )as its marginal transition.By augmenting the infection status of the agents I N ( x t ) = ( I n ( x t )) n ∈ [1: N ] , the ﬁrst twosteps in (48) and (50) are analogous to (25) and (27) for the SIS model, and the last step is toidentify an agent’s state given her augmented current infection status and previous state. Wepoint out that ν ( i n ) , ν ( x t − , i nt ) ∈ { (1 , , , (0 , , , (0 , , } . These expressions for ν ( i n ) and ν ( x t − , i nt ) exploit the following facts: ( i ) susceptible agents either remain susceptible or becomeinfected; ( ii ) infected agents either remain infected or recover; ( iii ) agents who have recoveredenjoy immunity. Algorithm 4 provides an algorithmic description of the resulting APF, whichhas the same cost as APF for the SIS model. We now consider approximation of the BIF (30) to construct a proposal distribution approximat-ing (29). At the terminal time T , it suﬃces to compute ψ T ( i T ) = Bin( y T ; i T , ρ ) ( i T ≥ y T ) for all i T ∈ [0 : N ] to represent the function ψ ?T ( x T ). As before, the next iterate ψ ?T − ( x T − ) requires anapproximation of the conditional expectation f θ ( ψ T | x T − ) = P Ni T =0 PoiBin( i T ; α I ( x T − )) ψ T ( i T ).Following the arguments in (31), we approximate α ( x T − ) by¯ α n ( x T − ) = ( ¯ α nS ( x T − ) , ¯ α nI ( x T − ) , ¯ α nR ( x T − )) , deﬁned as ¯ α nS ( x T − ) = S n ( x T − ) (cid:16) − ¯ λN − I ( x T − ) (cid:17) , ¯ α nI ( x T − ) = S n ( x T − )¯ λN − I ( x T − ) + I n ( x T − )(1 − ¯ γ ) , (51)¯ α nR ( x T − ) = I n ( x T − )¯ γ + R n ( x T − ) , lgorithm 4: Auxiliary particle ﬁlter for SIR model

Input:

Parameters θ ∈ Θ and number of particles P ∈ N compute v ( i )0 = PoiBin( i ; α ,I )Bin( y ; i, ρ ) ( i ≥ y ) for i ∈ [0 : N ]set w = P Ni =0 v ( i )0 normalize V ( i )0 = v ( i )0 /w for i ∈ [0 : N ]sample I ( p )0 ∼ Cat([0 : N ] , V (0: N )0 ), I N, ( p )0 | I ( p )0 ∼ CondBer( α ,I , I ( p )0 ) for p ∈ [1 : P ]sample X n, ( p )0 | I n, ( p )0 ∼ Cat([0 : 2] , ν ( I n, ( p )0 )) for p ∈ [1 : P ] and n ∈ [1 : N ] for t = 1 , · · · , T and p = 1 , · · · , P do compute v ( i,p ) t = PoiBin( i ; α I ( X ( p ) t − ))Bin( y t ; i, ρ ) ( i ≥ y t ) for i ∈ [0 : N ]set w ( p ) t = P Ni =0 v ( i,p ) t normalize V ( i,p ) t = v ( i,p ) t /w ( p ) t for i ∈ [0 : N ]normalize W ( p ) t = w ( p ) t / P Pk =1 w ( k ) t sample A ( p ) t − ∼ r ( ·| W (1: P ) t )sample I ( p ) t ∼ Cat([0 : N ] , V (0: N,A ( p ) t − ) t ) and I N, ( p ) t | I ( p ) t ∼ CondBer( α I ( X ( A ( p ) t − ) t − ) , I ( p ) t )sample X n, ( p ) t | X ( A ( p ) t − ) t − , I n, ( p ) t ∼ Cat([0 : 2] , ν ( X ( A ( p ) t − ) t − , I n, ( p ) t )) for n ∈ [1 : N ] endOutput: Marginal likelihood estimator ˆ p θ ( y T ) = w Q Tt =1 P − P Pp =1 w ( p ) t , states( X ( p ) t ) ( t,p ) ∈ [0: T ] × [1: P ] and ancestors ( A ( p ) t ) ( t,p ) ∈ [0: T − × [1: P ] which satisﬁes ¯ α nS ( x T − ) + ¯ α nI ( x T − ) + ¯ α nR ( x T − ) = 1. Writing the corresponding Markov tran-sition as ¯ f θ ( x T | x T − ) = Q Nn =1 Cat( x nT ; [0 : 2] , ¯ α n ( x T − )), we approximate the conditional expec-tation f θ ( ψ T | x T − ) by¯ f θ ( ψ T | S ( x T − ) , I ( x T − )) = N X i T =0 SumBin( i T ; S ( x T − ) , ¯ λN − I ( x T − ) , I ( x T − ) , − ¯ γ ) ψ T ( i T ) . (52)Although (52) is analogous to (32) for the SIS model, the number of susceptible agents S ( x T − )cannot be determined by just knowing the number of infections I ( x T − ) in the SIR model.Therefore it is necessary to account for both S ( x T − ) and I ( x T − ) in our approximation, i.e.we compute ψ T − ( s T − , i T − ) = Bin( y T − ; i T − , ρ ) ( i T − ≥ y T − ) ¯ f θ ( ψ T | s T − , i T − ) (53)for all ( s T − , i T − ) ∈ [0 : N ] × [0 : ( N − s T − )]. Since the number of agents in the population isﬁxed, it is also possible to work with the variables I ( x T − ) and R ( x T − ) instead. Subsequentlyfor t ∈ [0 : T − ψ t ( s t , i t ) = Bin( y t ; i t , ρ ) ( i t ≥ y t ) ¯ f θ ( ψ t +1 | s t , i t ) , ( s t , i t ) ∈ [0 : N ] × [0 : ( N − s t )] , (54)where ¯ f θ ( ψ t +1 | s t , i t ) = P Ns t +1 =0 P N − s t +1 i t +1 =0 ¯ f θ ( s t +1 , i t +1 | s t , i t ) ψ t +1 ( s t +1 , i t +1 ),¯ f θ ( s t +1 , i t +1 | s t , i t ) = Bin( s t +1 ; s t , − ¯ λN − i t )Bin( i t − i t +1 + s t − s t +1 ; i t , ¯ γ ) , (55)24or ( s t +1 , i t +1 ) ∈ [0 : s t ] × [( s t − s t +1 ) : ( i t + s t − s t +1 )], and zero otherwise. The above expressionfollows from the SIR model structure under homogeneous probabilities (51). Algorithm 5 sum-marizes our approximation of the BIF ( ψ t ) t ∈ [0: T ] , which costs O ( N T ) to compute and O ( N T )in storage. Algorithm 5:

Backward information ﬁlter approximation for SIR model

Input:

Parameters θ ∈ Θcompute ψ T ( i T ) = Bin( y T ; i T , ρ ) ( i T ≥ y T ) for i T ∈ [0 : N ] for t = T − , · · · , , s t = 0 , . . . , N and i t = 0 , · · · , N − s t doif t = T − then compute ¯ f θ ( ψ t +1 | s t , i t ) = P Ni t +1 =0 SumBin( i t +1 ; s t , ¯ λN − i t , i t , − ¯ γ ) ψ t +1 ( i t +1 ) else compute ¯ f θ ( ψ t +1 | s t , i t ) = P Ns t +1 =0 P N − s t +1 i t +1 =0 ¯ f θ ( s t +1 , i t +1 | i t , s t ) ψ t +1 ( s t +1 , i t +1 ) end compute ψ t ( s t , i t ) = Bin( y t ; i t , ρ ) ( i t ≥ y t ) ¯ f θ ( ψ t +1 | s t , i t ) endOutput: Approximate BIF ( ψ t ) t ∈ [0: T ] We can deﬁne our proposal distribution and SMC weight functions in the same manner as(34) and (35), respectively. Expectations appearing in these SMC weights can be computed as µ θ ( ψ ) = N X i =0 PoiBin( i ; α ,I ) ψ ( N − i , i ) , f θ ( ψ t | x t − ) = N X s t =0 N − s t X i t =0 f θ ( s t , i t | x t − ) ψ t ( s t , i t ) , (56)for t ∈ [1 : T −

1] and f θ ( ψ T | x T − ) = P Ni T =0 PoiBin( i T ; α I ( x T − )) ψ T ( i T ), where f θ ( s t , i t | x t − ) = PoiBin( s t ; α S ( x t − ))PoiBin( I ( x t − ) − i t + S ( x t − ) − s t ; ( I n ( x t − ) γ n ) n ∈ [1: N ] ) , (57)for ( s t , i t ) ∈ [0 : S ( x t − )] × [( S ( x t − ) − s t ) : ( I ( x t − ) + S ( x t − ) − s t )], and zero otherwise.The above expression should be understood as the analogue of (55) under the heterogeneousprobabilities (46). Sampling from the proposals can be done in a similar manner as the APF inSection 4.1. At the initial time, we sample from q ( x , i , i N | θ ) = q ( i | θ ) q ( i N | i , θ ) q ( x | i N , θ ) (58)= PoiBin( i ; α ,I ) ψ ( N − i , i ) µ θ ( ψ ) CondBer( i N ; α ,I , i ) N Y n =1 Cat( x n ; [0 : 2] , ν ( i n )) , which admits q ( x | θ ) as its marginal distribution. For time t ∈ [1 : T − q t ( x t , i t , i Nt | x t − , θ ) = q t ( i t | x t − , θ ) q t ( i Nt | x t − , i t , θ ) q t ( x t | x t − , i Nt , θ ) (59)= P N − i t s t =0 f θ ( s t , i t | x t − ) ψ t ( s t , i t ) f θ ( ψ t | x t − ) CondBer( i Nt ; α I ( x t − ) , i t ) N Y n =1 Cat( x nt ; [0 : 2] , ν ( x t − , i nt )) , q T ( x T , i T , i NT | x T − , θ ) = q T ( i T | x T − , θ ) q T ( i NT | x T − , i T , θ ) q T ( x T | x T − , i NT , θ ) (60)= PoiBin( i T ; α I ( x T − )) ψ T ( i T ) f θ ( ψ T | x T − ) CondBer( i NT ; α I ( x T − ) , i T ) N Y n =1 Cat( x nT ; [0 : 2] , ν ( x T − , i nT )) , which admits q t ( x t | x t − , θ ) as its marginal transition for all t ∈ [1 : T ]. Algorithm 6 details theresulting cSMC method, which costs the same as cSMC for the SIS model. Algorithm 6:

Controlled sequential Monte Carlo for SIR model

Input:

Parameters θ ∈ Θ, approximate BIF ( ψ t ) t ∈ [0: T ] and number of particles P ∈ N compute probabilities v ( i )0 = PoiBin( i ; α ) ψ ( i ) for i ∈ [0 : N ]compute v ( i )0 = PoiBin( i ; α ,I ) ψ ( N − i, i ) for i ∈ [0 : N ]set E = P Ni =0 v ( i )0 normalize V ( i )0 = v ( i )0 /E for i ∈ [0 : N ]sample I ( p )0 ∼ Cat([0 : N ] , V (0: N )0 ) and I N, ( p )0 | I ( p )0 ∼ CondBer( α ,I , I ( p )0 ) for p ∈ [1 : P ]sample X n, ( p )0 | I n, ( p )0 ∼ Cat([0 : 2] , ν ( I n, ( p )0 )) for p ∈ [1 : P ] and n ∈ [1 : N ]compute w ( p )0 = w ( X ( p )0 ) using (35) for p ∈ [1 : P ] for t = 1 , · · · , T and p = 1 , · · · , P do normalize W ( p ) t − = w ( p ) t − / P Pk =1 w ( k ) t − sample A ( p ) t − ∼ r ( ·| W (1: P ) t − ) if t < T then compute v ( i,p ) t = P N − is =0 f θ ( s, i | X ( A ( p ) t − ) t − ) ψ t ( s, i ) for i ∈ [0 : N ] else compute v ( i,p ) t = PoiBin( i ; α I ( X ( A ( p ) t − ) t − )) ψ t ( i ) for i ∈ [0 : N ] end set E ( p ) t = P Ni =0 v ( i,p ) t normalize V ( i,p ) t = v ( i,p ) t /E ( p ) t for i ∈ [0 : N ]sample I ( p ) t ∼ Cat([0 : N ] , V (0: N,p ) t ) and I N, ( p ) t | I ( p ) t ∼ CondBer( α I ( X ( A ( p ) t − ) t − ) , I ( p ) t )sample X n, ( p ) t | X ( A ( p ) t − ) t − , I n, ( p ) t ∼ Cat([0 : 2] , ν ( X ( A ( p ) t − ) t − , I n, ( p ) t )) for n ∈ [1 : N ]compute w ( p ) t = w t ( X ( p ) t ) using (35) endOutput: Marginal likelihood estimator ˆ p θ ( y T ) = Q Tt =0 P − P Pp =1 w ( p ) t , states( X ( p ) t ) ( t,p ) ∈ [0: T ] × [1: P ] and ancestors ( A ( p ) t ) ( t,p ) ∈ [0: T − × [1: P ] Our proposal distribution q θ ( x T ) also satisﬁes the Kullback–Leibler upper bound in Propo-sition 3.1, with appropriate notational extensions to the state space { , , } N . The followingresult is analogous to Proposition 3.2 for the SIS model.26 roposition 4.1. For each time t ∈ [0 : T ], the BIF approximation in Algorithm 5 satisﬁes: η ?t (log( ψ ?t /ψ t ) | θ ) ≤ T − X k = t c ?θ ( ψ k +1 ) {k φ ?θ k L ( η ?k ) k ∆ θ,S k L ( η ?k ) + k ξ ?θ k L ( η ?k ) k ∆ θ,R k L ( η ?k ) } , (61) η t (log( ψ t /ψ ?t ) | θ ) ≤ T − X k = t c θ ( ψ k +1 ) {k φ θ k L ( η k ) k ∆ θ,S k L ( η k ) + k ξ θ k L ( η k ) k ∆ θ,R k L ( η k ) } , (62)for t ∈ [0 : T ]. The constants are c ?θ ( ψ k ) = { min x k − ∈{ , , } N \ (0 ,..., f θ ( ψ k | x k − ) } − < ∞ , c θ ( ψ k ) = { min ( s k − ,i k − ) ∈ [0: N − i k − ] × [1: N ] ¯ f θ ( ψ k | s k − , i k − ) } − < ∞ for k ∈ [1 : T ], and the func-tions are ∆ θ,S ( x ) = P Nn =1 | ¯ α nS ( x ) − α nS ( x ) | , ∆ θ,R ( x ) = P Nn =1 | ¯ α nR ( x ) − α nR ( x ) | , φ ?θ ( x ) =  S ( x ) X s =0 PoiBin( s ; α S ( x )) PoiBin( s ; ¯ α S ( x ))  / , ξ ?θ ( x ) =  I ( x ) X r =0 PoiBin( r ; ( I n ( x ) γ n ) n ∈ [1: N ] ) PoiBin( r ; ( I n ( x )¯ γ ) n ∈ [1: N ] )  / , (63) φ θ ( x ) =  S ( x ) X s =0 PoiBin( s ; ¯ α S ( x )) PoiBin( s ; α S ( x ))  / , ξ θ ( x ) =  I ( x ) X r =0 PoiBin( r ; ( I n ( x )¯ γ ) n ∈ [1: N ] ) PoiBin( r ; ( I n ( x ) γ n ) n ∈ [1: N ] )  / , (64)for x ∈ { , , } N .The arguments in the proof of Proposition 4.1 in Appendix B.3 are similar to Proposition3.2, with some modiﬁcations tailored to the SIR model. This result shows the dependence of theBIF approximation error on our approximation of the transition probability α ( x ) via the terms∆ θ,S ( x ) and ∆ θ,R ( x ). As before, we can decompose ∆ θ,S ( x ) ≤ ∆ G θ ( x ) + ∆ λθ and ∆ θ,R ( x ) ≤ ∆ γθ into the elements deﬁned in (43). One can also obtain more ﬁne-grained approximations usingclustering in the spirit of Appendix B.4. Although agent-based models have been widely used as a simulation paradigm, statistical in-ference for these models has not received as much attention, due in part to the computationalchallenges involved. We have focused on agent-based models of disease transmission, and pre-sented new SMC methods that can estimate their likelihood more eﬃciently than the standardBPF.Our proposed methodology can be extended in various directions. Instead of the bino-mial model in (3), other observation models such as a negative binomial distribution could beconsidered. We could also adapt the controlled SMC methodology to handle the case whereobservations are only available at a collection of time instances, and we can expect further rela-tive gains in such settings. Future work could consider settings where the diﬀerence in infectioncounts between successive time steps is observed.We view our contribution as a step towards inference for larger classes of agent-based models,and that some of our contributions might be useful beyond the models considered here. We hopeto motivate further work on alleviating the computational burden of inference in agent-basedmodels, and that the removal of some of the computational bottlenecks might encourage furtherinvestigation on the statistical properties of these models.27 cknowledgments

This work was funded by CY Initiative of Excellence (grant “Investissements d’Avenir” ANR-16-IDEX-0008). Pierre E. Jacob gratefully acknowledges support by the National Science Foun-dation through grants DMS-1712872 and DMS-1844695.

References

Linda JS Allen and Amy M Burgin. Comparison of deterministic and stochastic SIS and SIR models indiscrete time.

Mathematical biosciences , 163(1):1–33, 2000.Linda JS Allen, Fred Brauer, Pauline Van den Driessche, and Jianhong Wu.

Mathematical epidemiology ,volume 1945. Springer, 2008.Christophe Andrieu and Gareth O Roberts. The pseudo-marginal approach for eﬃcient Monte Carlocomputations.

The Annals of Statistics , 37(2):697–725, 2009.Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo meth-ods.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 72(3):269–342, 2010.Andrew D Barbour and V Ćekanavićius. Total variation asymptotics for sums of independent integerrandom variables.

The Annals of Probability , 30(2):509–545, 2002.Andrew D Barbour and Peter Hall. On the rate of Poisson convergence. In

Mathematical Proceedings ofthe Cambridge Philosophical Society , volume 95, pages 473–480. Cambridge University Press, 1984.Andrew D Barbour and Aihua Xia. Poisson perturbations.

ESAIM: Probability and Statistics , 3:131–150,1999.Richard E Barlow and Klaus D Heidtmann. Computing k-out-of-n system reliability.

IEEE Transactionson Reliability , 33(4):322–323, 1984.Yoram Bresler. Two-ﬁlter formulae for discrete-time non-linear Bayesian smoothing.

International Journalof Control , 43(2):629–641, 1986.Mark Briers, Arnaud Doucet, and Simon Maskell. Smoothing algorithms for state–space models.

Annalsof the Institute of Statistical Mathematics , 62(1):61, 2010.Tom Britton. Stochastic epidemic models: A survey.

Mathematical biosciences , 225(1):24–35, 2010.Fan Bu, Allison E Aiello, Jason Xu, and Alexander Volfovsky. Likelihood-based inference for partiallyobserved epidemics on dynamic networks.

Journal of the American Statistical Association , pages 1–17,2020.James Carpenter, Peter Cliﬀord, and Paul Fearnhead. Improved particle ﬁlter for nonlinear problems.

IEE Proceedings-Radar, Sonar and Navigation , 146(1):2–7, 1999.V Cekanavicius and P Vaitkus. Centered Poisson approximation via Stein’s method.

Lithuanian Mathe-matical Journal , 41(4):319–329, 2001.Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling.

The Annals ofApplied Probability , 28(2):1099–1135, 2018. C-M Chen, Christopher C Drovandi, Jonathan M Keith, Ken Anthony, M Julian Caley, andKL Mengersen. Bayesian semi-individual based model with approximate bayesian computation forparameters calibration: Modelling crown-of-thorns populations on the Great Barrier Reef.

EcologicalModelling , 364:113–123, 2017.Sean X Chen and Jun S Liu. Statistical applications of the Poisson-Binomial and conditional Bernoullidistributions.

Statistica Sinica , pages 875–892, 1997.Xiang-Hui Chen, Arthur P Dempster, and Jun S Liu. Weighted ﬁnite population sampling to maximizeentropy.

Biometrika , 81(3):457–469, 1994.Donald Lee DeAngelis and Louis J Gross.

Individual-based models and approaches in ecology: Populations,communities and ecosystems . CRC Press, 2018.Pierre Del Moral.

Feynman-kac formulae: Genealogical and Interacting Particle Systems with Applica-tions . Springer-Verlag New York, 2004.Pierre Del Moral, Ajay Jasra, Anthony Lee, Christopher Yau, and Xiaole Zhang. The alive particle ﬁlterand its use in particle Markov chain Monte Carlo.

Stochastic Analysis and Applications , 33(6):943–974,2015.Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential Monte Carlo methods.In

Sequential Monte Carlo methods in practice , pages 3–14. Springer, 2001.Akira Endo, Edwin van Leeuwen, and Marc Baguelin. Introduction to particle Markov chain Monte Carlofor disease dynamics modellers.

Epidemics , 29:100363, 2019.Joshua M Epstein.

Generative social science: Studies in agent-based computational modeling , volume 13.Princeton University Press, 2006.CT Fan, Mervin E Muller, and Ivan Rezucha. Development of sampling plans by using sequential (itemby item) selection techniques and digital computers.

Journal of the American Statistical Association ,57(298):387–402, 1962.Paul Fearnhead and Peter Cliﬀord. On-line inference for hidden Markov models via particle ﬁlters.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 65(4):887–899, 2003.Jonathan Fintzi, Xiang Cui, Jon Wakeﬁeld, and Vladimir N Minin. Eﬃcient data augmentation for ﬁttingstochastic epidemic models to prevalence data.

Journal of Computational and Graphical Statistics , 26(4):918–929, 2017.Mathieu Gerber, Nicolas Chopin, and Nick Whiteley. Negative association, ordering and convergence ofresampling methods.

Annals of Statistics , 47(4):2236–2260, 2019.Zoubin Ghahramani and Michael I Jordan. Factorial hidden Markov models. In

Advances in NeuralInformation Processing Systems , pages 472–478, 1996.Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-gaussianBayesian state estimation. In

IEE proceedings F (radar and signal processing) , volume 140, pages107–113. IET, 1993.Jakob Grazzini, Matteo G Richiardi, and Mike Tsionas. Bayesian estimation of agent-based models.

Journal of Economic Dynamics and Control , 77:26–47, 2017. ieralberto Guarniero, Adam M Johansen, and Anthony Lee. The iterated auxiliary particle ﬁlter. Journal of the American Statistical Association , 112(520):1636–1647, 2017.C Marijn Hazelbag, Jonathan Dushoﬀ, Emanuel M Dominic, Zinhle E Mthombothi, and Wim Delva. Cal-ibration of individual-based models to epidemiological data: A systematic review.

PLoS computationalbiology , 16(5):e1007893, 2020.Jeremy Heng, Adrian N Bishop, George Deligiannidis, and Arnaud Doucet. Controlled sequential MonteCarlo.

Annals of Statistics , 48(5):2904–2929, 2020a.Jeremy Heng, Pierre E. Jacob, and Nianqiao Ju. A simple Markov chain for independent Bernoullivariables conditioned on their sum. arXiv preprint arXiv:2012.03103 , 2020b.Joseph L Hodges and Lucien Le Cam. The Poisson approximation to the Poisson binomial distribution.

The Annals of Mathematical Statistics , 31(3):737–740, 1960.Yili Hong. On computing the distribution function for the Poisson binomial distribution.

ComputationalStatistics & Data Analysis , 59:41–51, 2013.Mevin Hooten, Christopher Wikle, and Michael Schwob. Statistical implementations of agent-baseddemographic models.

International Statistical Review , 88(2):441–461, 2020.Edward L Ionides, Carles Bretó, and Aaron A King. Inference for nonlinear dynamical systems.

Proceed-ings of the National Academy of Sciences , 103(49):18438–18443, 2006.Adam M Johansen and Arnaud Doucet. A note on auxiliary particle ﬁlters.

Statistics & ProbabilityLetters , 78(12):1498–1504, 2008.Nikolas Kantas, Arnaud Doucet, Sumeetpal S Singh, Jan Maciejowski, and Nicolas Chopin. On particlemethods for parameter estimation in state-space models.

Statistical science , 30(3):328–351, 2015.Mira Kattwinkel and Peter Reichert. Bayesian parameter inference for individual-based models using aParticle Markov Chain Monte Carlo method.

Environmental Modelling & Software , 87:110–119, 2017.Augustine Kong, Jun S Liu, and Wing Hung Wong. Sequential imputations and Bayesian missing dataproblems.

Journal of the American statistical association , 89(425):278–288, 1994.Marc Lipsitch, Ted Cohen, Ben Cooper, James M Robins, Stefan Ma, Lyn James, Gowri Gopalakrishna,Suok Kai Chew, Chorh Chuan Tan, Matthew H Samore, et al. Transmission dynamics and control ofsevere acute respiratory syndrome.

Science , 300(5627):1966–1970, 2003.Jun S Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems.

Journal of theAmerican statistical association , 93(443):1032–1044, 1998.R. McVinish and P. K. Pollett. A central limit theorem for a discrete-time SIS model with individualvariation.

Journal of Applied Probability , 49(2):521–530, 2012. doi: 10.1239/jap/1339878802.Dave Osthus, James Gattiker, Reid Priedhorsky, and Sara Y Del Valle. Dynamic Bayesian inﬂuenzaforecasting in the United States with hierarchical discrepancy (with discussion).

Bayesian Analysis ,14(1):261–312, 2019.Michael K Pitt and Neil Shephard. Filtering via simulation: Auxiliary particle ﬁlters.

Journal of theAmerican statistical association , 94(446):590–599, 1999.Donovan Platt. A comparison of economic agent-based model calibration methods.

Journal of EconomicDynamics and Control , 113:103859, 2020. arcel Scharth and Robert Kohn. Particle eﬃcient importance sampling. Journal of Econometrics , 190(1):133–147, 2016.Jukka Sirén, Luc Lens, Laurence Cousseau, and Otso Ovaskainen. Assessing the dynamics of naturalpopulations by ﬁtting individual-based models with approximate Bayesian computation.

Methods inEcology and Evolution , 9(5):1286–1295, 2018.Seth Tisue and Uri Wilensky. Netlogo: A simple environment for modeling complexity. In

Internationalconference on complex systems , volume 21, pages 16–21. Boston, MA, 2004.Arthur Turrell. Agent-based models: Understanding the economy from the bottom up.

Bank of EnglandQuarterly Bulletin , page Q4, 2016.Elske van der Vaart, Alice SA Johnston, and Richard M Sibly. Predicting how many animals will be where:How to build, calibrate and evaluate individual-based models.

Ecological modelling , 326:113–123, 2016.A Yu Volkova. A reﬁnement of the central limit theorem for sums of independent random indicators.

Theory of Probability & Its Applications , 40(4):791–794, 1996.Yuan H Wang. On the number of successes in independent trials.

Statistica Sinica , pages 295–312, 1993.Nick Whiteley and Lorenzo Rimella. Inference in stochastic epidemic models via multinomial approxi-mations. arXiv preprint arXiv:2006.13700 , 2020.

A Poisson binomial distributions

A.1 Recursive deﬁnition of Poisson binomial probabilities

The following provides a derivation of the recursion (10) for the function q ( i, n ) = PoiBin( i ; α n : N ) = P ( P Nm = n X m = i ) with X m ∼ Ber( α m ) independently. The recursion allows the computation of Poissonbinomial probabilities in O ( N ) operations. Under X n ∼ Ber( α n ) independently for n ∈ [1 : N ], theinitial conditions are given by q (0 , n ) = P N X m = n X m = 0 ! = N Y m = n P ( X m = 0) = N Y m = n (1 − α m ) , n ∈ [1 : N ] , (65)in the case of no success, q (1 , N ) = P ( X N = 1) = α N where the sum reduces to a single Bernoullivariable, and q ( i, n ) = 0 for i > N − n + 1 because a sum of N − n + 1 Bernoulli variables cannot belarger than N − n + 1, in particular q ( i, N ) = 0 for all i ≥

2. For i ∈ [1 : N ] and n ∈ [1 : N − X n ∈ { , } , the law of total probability gives (10): q ( i, n ) = P ( X n = 1) P N X m = n X m = i | X n = 1 ! + P ( X n = 0) P N X m = n X m = i | X n = 0 ! = α n P N X m = n +1 X m = i − ! + (1 − α n ) P N X m = n +1 X m = i ! (66)= α n q ( i − , n + 1) + (1 − α n ) q ( i, n + 1) . A.2 A thinning result

We show that under the static model X n ∼ Ber( α n ) independently for n ∈ [1 : N ] and the observa-tion model Y | X = x ∼ Bin( I ( x ) , ρ ), we have Y ∼ PoiBin( ρ α ) marginally. We ﬁrst note that the haracteristic function of I ( X ) = P Nn =1 X n ∼ PoiBin( α ) is given by E [exp( iωI ( X ))] = N Y n =1 E [exp( iωX n )] = N Y n =1 { α n exp( iω ) + (1 − α n ) } (67)for ω ∈ R . Consider the representation Y = P In =1 Z n where Z n ∼ Ber( ρ ) independently. Note thatthe characteristic function of each Bernoulli random variable with success probability ρ is ϕ Z ( ω ) = E [exp( iωZ n )] = ρ exp( iω ) + (1 − ρ ). By the law of iterated expectations, the characteristic function of Y is E [exp( iωY )] = E " E " exp iω I X n =1 Z n ! | I = E (cid:2) ϕ Z ( ω ) I (cid:3) = N Y n =1 E h ϕ Z ( ω ) X n i (68)= N Y n =1 { α n ϕ Z ( ω ) + (1 − α n ) } = N Y n =1 { ρα n exp( iω ) + (1 − ρα n ) } for ω ∈ R . By comparing this characteristic function with (67), we can conclude that Y ∼ PoiBin( ρ α ). A.3 Comparing two Poisson binomial distributions

The following result will be of use in Appendix B.

Lemma A.1.

Let PoiBin( α ) and PoiBin(¯ α ) denote two Poisson binomial distributions with probabilities α = ( α n ) n ∈ [1: N ] and ¯ α = (¯ α n ) n ∈ [1: N ] , respectively. The ‘ -norm between these PMFs satisﬁes X i ∈ [0: N ] { PoiBin( i ; ¯ α ) − PoiBin( i ; α ) } ≤ N X n =1 | ¯ α n − α n | ! . (69)The Kullback–Leibler divergence from PoiBin( α ) to PoiBin(¯ α ), deﬁned asKL (PoiBin(¯ α ) | PoiBin( α )) = X i ∈ [0: N ] PoiBin( i ; ¯ α ) log (cid:18) PoiBin( i ; ¯ α )PoiBin( i ; α ) (cid:19) , is upper bounded byKL (PoiBin(¯ α ) | PoiBin( α )) ≤  X i ∈ [0: N ] PoiBin( i ; ¯ α ) PoiBin( i ; α )  / N X n =1 | ¯ α n − α n | ! . (70) Proof.

To bound the ‘ -norm between two Poisson binomial PMFs, we rely on Parseval’s identity X i ∈ [0: N ] { PoiBin( i ; ¯ α ) − PoiBin( i ; α ) } = 12 π Z π − π { ¯ ϕ ( ω ; ¯ α ) − ϕ ( ω ; α } dω, (71)where ¯ ϕ ( ω ; ¯ α ) = N Y n =1 { ¯ α n exp( iω ) + (1 − ¯ α n ) } , ϕ ( ω ; α ) = N Y n =1 { α n exp( iω ) + (1 − α n ) } , (72) enote the characteristic functions of PoiBin(¯ α ) and PoiBin( α ), respectively (as in (67)). Noting thateach term of the products in (72) is at most one, by repeated applications of triangle inequality on thedecomposition¯ ϕ ( ω ; ¯ α ) − ϕ ( ω ; α ) (73)= N X n =1 (¯ α n − α n )(exp( iω ) − n − Y m =1 { α m exp( iω ) + (1 − α m ) } N Y p = n +1 { ¯ α p exp( iω ) + (1 − ¯ α p ) } (with the convention Q pn = m = 1 for p < m ), we have (cid:26) π Z π − π { ¯ ϕ ( ω ; ¯ α ) − ϕ ( ω ; α ) } dω (cid:27) / ≤ N X n =1 (cid:26) π Z π − π (¯ α n − α n ) (exp( iω ) − dω (cid:27) / = N X n =1 | ¯ α n − α n | . Hence (69) follows by squaring both sides and applying the identity in (71).To bound the Kullback–Leibler divergence, we apply the inequality log( x ) ≤ x − x > α ) | PoiBin( α )) ≤ X i ∈ [0: N ] (cid:26) PoiBin( i ; ¯ α )PoiBin( i ; α ) (cid:27) { PoiBin( i ; ¯ α ) − PoiBin( i ; α ) }≤  X i ∈ [0: N ] PoiBin( i ; ¯ α ) PoiBin( i ; α )  /  X i ∈ [0: N ] { PoiBin( i ; ¯ α ) − PoiBin( i ; α ) }  / . (74)Applying the inequality in (69) leads to (70). B Controlled sequential Monte Carlo

B.1 Performance of controlled sequential Monte Carlo

Proof of Proposition 3.1.

Using the form of the smoothing distribution p θ ( x T | y T ) in (28) and (29),and the deﬁnition of the proposal distribution q θ ( x T ) in (34), we havelog (cid:18) p θ ( x T | y T ) q θ ( x T ) (cid:19) = log (cid:18) µ θ ( ψ ) µ θ ( ψ ? ) (cid:19) + T X t =0 log (cid:18) ψ ?t ( x t ) ψ t ( I ( x t )) (cid:19) + T X t =1 log (cid:18) f θ ( ψ t | x t − ) f θ ( ψ ?t | x t − ) (cid:19) . (75)By the log-sum inequalitylog (cid:18) µ θ ( ψ ) µ θ ( ψ ? ) (cid:19) ≤ µ θ ( ψ ) − X x ∈{ , } N µ θ ( x ) ψ ( x ) log (cid:18) ψ ( I ( x )) ψ ? ( x ) (cid:19) ≤ η (log( ψ /ψ ? ) | θ ) (76)and log (cid:18) f θ ( ψ t | x t − ) f θ ( ψ ?t | x t − ) (cid:19) ≤ f θ ( ψ t | x t − ) − X x t ∈{ , } N f θ ( x t | x t − ) ψ t ( I ( x t )) log (cid:18) ψ t ( I ( x t )) ψ ?t ( x t ) (cid:19) ≤ q t (log( ψ t /ψ ?t ) | x t − , θ ) . (77)Using the expression in (75) and the inequalities (76)-(77), the Kullback–Leibler divergence from q θ ( x T )to p θ ( x T | y T ) satisﬁesKL ( p θ ( x T | y T ) | q θ ( x T )) = X x T ∈{ , } N × ( T +1) p θ ( x T | y T ) log (cid:18) p θ ( x T | y T ) q θ ( x T ) (cid:19) (78) ≤ η (log( ψ /ψ ? ) | θ ) + T X t =0 η ?t (log( ψ ?t /ψ t ) | θ ) + T X t =1 X x t − ∈{ , } N η ?t − ( x t − | θ ) q t (log( ψ t /ψ ?t ) | x t − , θ ) . quation (39) follows by employing the upper bound M t = max x t ∈{ , } N η ?t ( x t | θ ) /η t ( x t | θ ) and notingthat X x t − ∈{ , } N η t − ( x t − | θ ) q t (log( ψ t /ψ ?t ) | x t − , θ ) = η t (log( ψ t /ψ ?t ) | θ ) . (79) B.2 Marginal distributions of smoothing distribution and proposal distribu-tion

To show that the constants M t = max x t ∈{ , } N η ?t ( x t | θ ) /η t ( x t | θ ) in Proposition 3.1 are ﬁnite for all t ∈ [0 : T − p θ ( x T | y T ) /q θ ( x T ) by a constant for all x T ∈{ , } N ( T +1) . Under the requirement (21), we have p θ ( x T | y T ) q θ ( x T ) = p θ ( y T ) − T − Y t =0 w t ( x t ) , (80)hence we will argue that each of the above weight functions can be upper bounded by some constant.From (35), the weight functions can be rewritten as w ( x ) = µ θ ( ψ ) f θ ( ψ | x ) / ¯ f θ ( ψ | I ( x )) and w t ( x t ) = f θ ( ψ t +1 | x t ) / ¯ f θ ( ψ t +1 | I ( x t )) for t ∈ [1 : T − f θ ( ψ t +1 | x t ) / ¯ f θ ( ψ t +1 | I ( x t )) = 1 when x nt = 0 for all n ∈ [1 : N ] and t ∈ [0 : T − ψ t ) t ∈ [0: T ] are upper bounded by one, hence µ θ ( ψ ) ≤ f θ ( ψ t | x t − ) ≤ x t − ∈ { , } N and t ∈ [1 : T ]. It remains to show that ¯ f θ ( ψ t +1 | I ( x t )) is lowerbounded by some strictly positive constant. We notice that the conditional expectation in (32) can belower bounded by ¯ f θ ( ψ t +1 | I ( x t )) ≥ (cid:0) ¯ λN − I ( x t ) (cid:1) N − I ( x t ) (1 − ¯ γ ) I ( x t ) ψ t +1 ( N ) . (81)Deﬁne the constant c = min i ∈ [1: N ] (cid:0) ¯ λN − i (cid:1) N − i (1 − ¯ γ ) i >

0. At the terminal time, we have ψ T ( N ) =Bin( y T ; N, ρ ) >

0. By iterating the backward recursion, we have ψ t ( N ) = Bin( y t ; N, ρ ) ¯ f θ ( ψ t +1 | N ) ≥ Bin( y t ; N, ρ ) c ψ t +1 ( N ) , (82)for each t ∈ [0 : T − f θ ( ψ t +1 | I ( x t )) ≥ c T − t Q Tk = t +1 Bin( y k ; N, ρ ) for all t ∈ [0 : T −

1] and x t ∈ { , } N . Combining the above observations allows us to conclude. B.3 Backward information ﬁlter approximation

Proof of Proposition 3.2.

We will establish (40); the proof of (41) follows using similar arguments. Deﬁne e t = η ?t (log( ψ ?t /ψ t ) | θ ) for each time t ∈ [0 : T ]. At the terminal time T , we have e T = 0. For time t ∈ [0 : T − e t = η ?t (log( ψ ?t / ˜ ψ t ) | θ ) + η ?t (log( ˜ ψ t /ψ t ) | θ ) (83)where ˜ ψ t ( x t ) = Bin( y t ; I ( x t ) , ρ ) ( I ( x t ) ≥ y t ) f θ ( ψ t +1 | x t ) . (84) sing the log-sum inequality and the decomposition of the smoothing distribution given in (28) and (29),the ﬁrst term of (83) is bounded by η ?t (log( ψ ?t / ˜ ψ t ) | θ ) = X x t ∈{ , } N log (cid:18) f θ ( ψ ?t +1 | x t ) f θ ( ψ t +1 | x t ) (cid:19) η ?t ( x t | θ ) ≤ X x t ∈{ , } N f θ ( ψ ?t +1 | x t ) − X x t +1 ∈{ , } N f θ ( x t +1 | x t ) ψ ?t +1 ( x t +1 ) log (cid:18) ψ ?t +1 ( x t +1 ) ψ t +1 ( I ( x t +1 )) (cid:19) η ?t ( x t | θ ) ≤ X x t ∈{ , } N X x t +1 ∈{ , } N log (cid:18) ψ ?t +1 ( x t +1 ) ψ t +1 ( I ( x t +1 )) (cid:19) η ?t ( x t | θ ) p θ ( x t +1 | x t , y t +1: T ) = e t +1 . (85)By the log-sum inequality, the second term of (83) is bounded by η ?t (log( ˜ ψ t /ψ t ) | θ ) = X x t ∈{ , } N \ (0 ,..., log (cid:18) f θ ( ψ t +1 | x t )¯ f θ ( ψ t +1 | I ( x t )) (cid:19) η ?t ( x t | θ ) ≤ X x t ∈{ , } N \ (0 ,..., f θ ( ψ t +1 | x t ) − N X i t +1 =0 PoiBin( i t +1 ; α ( x t )) ψ t +1 ( i t +1 ) log (cid:18) PoiBin( i t +1 ; α ( x t ))PoiBin( i t +1 ; ¯ α ( x t )) (cid:19) η ?t ( x t | θ ) ≤ c ?θ ( ψ t +1 ) X x t ∈{ , } N KL (PoiBin( α ( x t )) | PoiBin(¯ α ( x t ))) η ?t ( x t | θ ) . (86)The above repeatedly uses the fact that in the case of zero infections, i.e. x nt = 0 for all n ∈ [1 : N ], wehave α n ( x t ) = ¯ α n ( x t ) = 0 for all n ∈ [1 : N ]. The constant c ?θ ( ψ t +1 ) can be shown to be ﬁnite using thearguments in Appendix B.2. The last inequality also uses the fact that max i t +1 ∈ [0: N ] | ψ t +1 ( i t +1 ) | ≤ η ?t (log( ˜ ψ t /ψ t ) | θ ) ≤ c ?θ ( ψ t +1 ) k φ ?θ k L ( η ?t ) k ∆ θ k L ( η ?t ) . (87)By combining (85) and (87), we obtain the following recursion e t ≤ e t +1 + c ?θ ( ψ t +1 ) k φ ?θ k L ( η ?t ) k ∆ θ k L ( η ?t ) . (88)The claim in (40) follows by induction. Proof of Proposition 4.1.

The arguments are similar to Proposition 3.2 with some modiﬁcations that wewill outline below. Like before, we consider the quantity e t = η ?t (log( ψ ?t /ψ t ) | θ ) for t ∈ [0 : T ] in (61) as(62) follows using similar arguments. By the same arguments in (85) η ?t (log( ψ ?t / ˜ ψ t ) | θ ) ≤ e t +1 . (89)Instead of (86), we have η ?t (log( ˜ ψ t /ψ t ) | θ ) ≤ c ?θ ( ψ t +1 ) X x t ∈{ , , } N KL (cid:0) f θ ( s t +1 , i t +1 | x t ) | ¯ f θ ( s t +1 , i t +1 | S ( x t ) , I ( x t )) (cid:1) η ?t ( x t | θ ) . (90)The Kullback–Leibler divergence from ¯ f θ ( s t +1 , i t +1 | S ( x t ) , I ( x t )) to f θ ( s t +1 , i t +1 | x t ) can be decomposedasKL (cid:0) f θ ( s t +1 , i t +1 | x t ) | ¯ f θ ( s t +1 , i t +1 | S ( x t ) , I ( x t )) (cid:1) (91)= KL (cid:0) f θ ( s t +1 | x t ) | ¯ f θ ( s t +1 | S ( x t ) , I ( x t )) (cid:1) + N X s t +1 =0 f θ ( s t +1 | x t )KL (cid:0) f θ ( i t +1 | x t , s t +1 ) | ¯ f θ ( i t +1 | S ( x t ) , I ( x t ) , s t +1 ) (cid:1) ≤ KL (PoiBin( α S ( x t )) | PoiBin(¯ α S ( x t ))) + KL (cid:0) PoiBin(( I n ( x t ) γ ) n ∈ [1: N ] ) | PoiBin(( I n ( x t )¯ γ n ) n ∈ [1: N ] ) (cid:1) . pplying (70) of Lemma A.1 givesKL (cid:0) f θ ( s t +1 , i t +1 | x t ) | ¯ f θ ( s t +1 , i t +1 | S ( x t ) , I ( x t )) (cid:1) ≤ φ ?θ ( x t )∆ θ,S ( x t ) + ξ ?θ ( x t )∆ θ,R ( x t ) . (92)Hence by Cauchy–Schwarz inequality, we have η ?t (log( ˜ ψ t /ψ t ) | θ ) ≤ c ?θ ( ψ t +1 ) {k φ ?θ k L ( η ?t ) k ∆ θ,S k L ( η ?t ) + k ξ ?θ k L ( η ?t ) k ∆ θ,R k L ( η ?t ) } . (93) B.4 Finer-grained approximations

We discuss ﬁner approximations of the BIF and provide algorithmic details of the resulting cSMC. Insteadof simply approximating individual infection and recovery rates by their population averages, we considera clustering of agents based on their infection and recovery rates, and approximate these rates by theirwithin-cluster average. More precisely, if K ∈ [1 : N ] denote the desired number of clusters, ( C k ) k ∈ [1: K ] denote a partition of [1 : N ] that represents our clustering and N k = | C k | denotes the number of agents incluster k ∈ [1 : K ], we approximate λ n ≈ ¯ λ k = N − k P n ∈ C k λ n and γ n ≈ ¯ γ k = N − k P n ∈ C k γ n . For eachstate of the population x ∈ { , } N , we deﬁne the summary I k ( x ) = P n ∈ C k x n , which counts the numberof infected agents in cluster k ∈ [1 : K ]. We will write x C k = ( x n ) n ∈ C k and I K ( x ) = ( I k ( x )) k ∈ [1: K ] .Note that P Kk =1 I k ( x ) = I ( x ). The corresponding approximation of α ( x ) is given by¯ α n ( x ) =  ¯ λ c ( n ) N − I ( x ) , if x n = 0 , − ¯ γ c ( n ) , if x n = 1 , (94)where c : [1 : N ] → [1 : K ] maps agent labels to cluster membership. We will write associated Markovtransition as ¯ f θ ( x t | x t − ) = Q Nn =1 Ber( x nt ; ¯ α n ( x t − )). The approximation in (94) recovers (31) in the caseof K = 1 cluster, in which case I ( x ) = I ( x ) ∈ [0 : N ]. Larger values of K can be seen as ﬁner-grainedapproximations at the expense of increased dimensionality of I K ( x ) ∈ Q Kk =1 [0 : N k ]. Having K = N clusters oﬀers no dimensionality reduction as I N ( x ) = x ∈ { , } N .At the terminal time T , since we have ψ ?T ( x T ) = Bin( y T ; I ( x T ) , ρ ) ( I ( x T ) ≥ y T ), it suﬃces tocompute ψ T ( i KT ) = Bin( y T ; P Kk =1 i kT , ρ ) ( P Kk =1 i kT ≥ y T ) for all i KT ∈ Q Kk =1 [0 : N k ]. To approximatethe backward recursion (30) for t ∈ [0 : T − ψ t +1 ( I K ( x t +1 )) at time t + 1. By plugging in the approximation ψ t +1 ( I K ( x t +1 )) ≈ ψ ?t +1 ( x t +1 ),we consider the iterate Bin( y t ; I ( x t ) , ρ ) ( I ( x t ) ≥ y t ) f θ ( ψ t +1 | x t ) (95)to form an approximation of ψ ?t ( x t ). We approximate the conditional expectation f θ ( ψ t +1 | x t ) by¯ f θ ( ψ t +1 | I K ( x t )) = X i Kt +1 ∈ Q Kk =1 [0: N k ] K Y k =1 SumBin( i kt +1 ; N k − I k ( x t ) , ¯ λ k N − I ( x t ) , I k ( x t ) , − ¯ γ k ) ψ t +1 ( i Kt +1 ) . (96)The resulting approximation of ψ ?t ( x t ) is computed using ψ t ( i Kt ) = Bin y t ; K X k =1 i kt , ρ ! K X k =1 i kt ≥ y t ! ¯ f θ ( ψ t +1 | i Kt ) (97)for all i Kt ∈ Q Kk =1 [0 : N k ] in O (cid:16)Q Kk =1 (1 + N k ) (cid:17) cost. Hence our overall approximation of the BIF( ψ t ) t ∈ [0: T ] costs O ( Q Kk =1 (1 + N k ) T ) to compute and O ( Q Kk =1 (1 + N k ) T ) in storage. It is straightforwardto extend Proposition 3.2 to characterize the error of (97); we omit this for the sake of brevity. he corresponding approximation of the optimal proposal (29) is q ( x | θ ) = µ θ ( x ) ψ ( I K ( x )) µ θ ( ψ ) , q t ( x t | x t − , θ ) = f θ ( x t | x t − ) ψ t ( I K ( x t )) f θ ( ψ t | x t − ) , t ∈ [1 : T ] . (98)Analogous to (35), the appropriate SMC weight functions are w ( x ) = µ θ ( ψ ) g θ ( y | x ) f θ ( ψ | x ) ψ ( I K ( x )) , w t ( x t ) = g θ ( y t | x t ) f θ ( ψ t +1 | x t ) ψ t ( I K ( x t )) , t ∈ [1 : T − , (99)and w T ( x T ) = 1. The above expectations are computed as µ θ ( ψ ) = X i K ∈ Q Kk =1 [0: N k ] K Y k =1 PoiBin( i k ; α C k ) ψ ( i K ) , (100) f θ ( ψ t | x t − ) = X i K ∈ Q Kk =1 [0: N k ] K Y k =1 PoiBin( i kt ; α C k ( x t − )) ψ t ( i Kt ) , t ∈ [1 : T ] , (101)where α C k = ( α n ) n ∈ C k and α C k ( x t − ) = ( α n ( x t − )) n ∈ C k . Sampling from the proposals (98) involvesinitializing from q ( x , i K | θ ) = q ( i K | θ ) K Y k =1 q ( x C k | i k , θ ) = Q Kk =1 PoiBin( i k ; α C k ) ψ ( i K ) µ θ ( ψ ) K Y k =1 CondBer( x C k ; α C k , i k )(102)which admits q ( x | θ ) as its marginal distribution, and for time t ∈ [1 : T ] sampling from q t ( x t , i Kt | x t − , θ ) = q t ( i Kt | x t − , θ ) K Y k =1 q t ( x C k t | x t − , i kt , θ ) (103)= Q Kk =1 PoiBin( i kt ; α C k ( x t − )) ψ t ( i Kt ) f θ ( ψ t | x t − ) K Y k =1 CondBer( x C k t ; α C k ( x t − ) , i kt )which admits q t ( x t | x t − , θ ) as its marginal transition. The overall cost of implementing cSMC is O K X k =1 N k + K Y k =1 N k ! T P ! , (104)or O K X k =1 N k log( N k ) + K Y k =1 N k ! T P ! , (105)if one employs translated Poisson approximations (12) and the MCMC in Heng et al. [2020b] to samplefrom conditioned Bernoulli distributions. As alluded earlier, the choice of the number of clusters K allowsone to trade-oﬀ the quality of our proposal for computation complexity. C Posterior sampling of agent states using MCMC

We now describe various MCMC algorithms that can be used to sample from the smoothing distribution p θ ( x T | y T ) of the agent-based SIS model in Section 3. These Gibbs sampling strategies can be seen asalternatives to the SMC approach in Sections 3.1 and 3.2. The same ideas can also be applied to the SIRmodel in Section 4 with some modiﬁcations.A simple option is a single-site update that samples the state of an agent at a speciﬁc time from thefull conditional distribution p θ ( x nt | x t − , x − nt , x t +1: T , y T ), where x − nt = ( x mt ) m ∈ [1: N ] \ n , for t ∈ [0 : T ] nd n ∈ [1 : N ] that can be chosen deterministically or randomly with uniform probabilities usinga systematic or random Gibbs scan. Using the conditional independence structure of the model, for t ∈ [1 : T − α n ( x t − )Bin( y t ; I (˜ x t ) , ρ ) Y m ∈N ( n ) Ber( x mt +1 ; α m (˜ x t )) , (106)where ˜ x t = ( x mt ) m ∈ [1: N ] with x nt = 1. This expression also holds for the case t = 0 by replacing α n ( x t − )with α n , and t = T by removing the last product. As each update costs O ( D ( n )), the overall cost of aGibbs scan is O ( T P Nn =1 D ( n )).As discussed in Ghahramani and Jordan [1996] for a related class of models, one can considerblock updates that jointly sample the trajectory of a few agents. For a subset b ⊆ [1 : N ] of size B ,this Gibbs update involves sampling from p θ ( x b T | x − b T , y T ), where x b T = ( x nt ) t ∈ [0: T ] ,n ∈ b and x − b T =( x nt ) t ∈ [0: T ] ,n ∈ [1: N ] \ b . This can be done using the forward-backward algorithm in O ((2 B + 2 B N ) T ) cost.Hence the overall cost of a systematic Gibbs scan over the entire population is O ((2 B + 2 B N ) T N/B ).As the latter is computationally prohibitive for large B , one has to consider reasonably small block sizes.We can also employ a swap update that leaves the full conditional distribution p θ ( x t | x t − , x t +1: T , y T )invariant for each t ∈ [0 : T ] (with x − = x T +1: T = ∅ ). Let x t denote the current state of the populationat time t . We will choose n and n uniformly from the sets { i : x it = 0 } and { i : x it = 1 } respectively. Theproposed state ˜ x t is such that ˜ x n t = 1, ˜ x n t = 0 and ˜ x nt = x nt for n ∈ [1 : N ] \ { n , n } . For t ∈ [1 : T − ( , α n ( x t − )(1 − α n ( x t − )) Q Nn =1 Ber( x nt +1 ; α n (˜ x t )) α n ( x t − )(1 − α n ( x t − )) Q Nn =1 Ber( x nt +1 ; α n ( x t )) ) . (107)The case t = 0 can be accommodated by replacing α ( x t − ) with α , and t = T by removing the ratiosof Bernoulli PMFs. The cost of each swap update ranges between O (1) and O ( N ), depending on thestructure of the network. As swap updates on their own do not change the number of infected agents, wepropose to employ them within a mixture kernel that also relies on the above-mentioned forward-backwardscans.As alternatives to PMMH, we can consider Gibbs samplers that alternate between sampling theparameters from p ( θ | x T , y T ) using a MH algorithm, and the agent states from p θ ( x T | y T ) using theabove MCMC algorithms. In particular, we employ an equally weighted mixture of kernels, performingeither N random swap updates for each time step, or systematic Gibbs scan with single-site updates(“single-site”) or block updates for B = 5 agents (“block5”). We compare these Gibbs samplers toPMMH on the simulated dataset of Section 3.3. We set the MH Normal proposal standard deviationwithin Gibbs samplers as 0 .

08 to achieve suitable acceptance probabilities. We initialize the parametersfrom either the DGP or the prior distribution, and run all algorithms for the same number of iterations.Figure 7 displays the resulting approximations of the marginal posterior distribution of two parameters.Compared to the Gibbs samplers, the PMMH approximations seem to be more stable between runs withdiﬀerent initializations. These plots also suggest that the Gibbs samplers provide a poorer explorationof the tails of the posterior distribution, which may be attributed to the dependence between the agentstates and parameters in this setting. rior DGP −12 −9 −6 −3−12 −9 −6 −30.000.020.040.060.000.020.040.060.000.020.040.06 b den s i t y prior DGP p mm hb l o ck s i ng l e - s i t e r Figure 7: Histograms illustrating approximations of the marginal posterior distributions of β and ρρ