[PDF] Computation of epidemic final size distributions

Abstract

We develop a new methodology for the efficient computation of epidemic final size distributions for a broad class of Markovian models. We exploit a particular representation of the stochastic epidemic process to derive a method which is both computationally efficient and numerically stable. The algorithms we present are also physically transparent and so allow us to extend this method from the basic SIR model to a model with a phase-type infectious period and another with waning immunity. The underlying theory is applicable to many Markovian models where we wish to efficiently calculate hitting probabilities.

Full PDF

aa r X i v : . [ q - b i o . P E ] J a n Computation of epidemic ﬁnal size distributions

Andrew J. Black a, ∗ , J. V. Ross a a School of Mathematical Sciences, The University of Adelaide, Adelaide SA 5005, Australia.

Abstract

We develop a new methodology for the e ﬃ cient computation of epidemic ﬁnal size distributions for a broad class of Markovianmodels. We exploit a particular representation of the stochastic epidemic process to derive a method which is both computationallye ﬃ cient and numerically stable. The algorithms we present are also physically transparent and so allow us to extend this methodfrom the basic SIR model to a model with a phase-type infectious period and another with waning immunity. The underlying theoryis applicable to many Markovian models where we wish to e ﬃ ciently calculate hitting probabilities. Keywords:

Markov chain, degree-of-advancement, hitting probabilities, waning immunity

1. Introduction

Markov chains, in both discrete and continuous time, havefound widespread use throughout biology (Renshaw, 1993;Sontag et al., 2006; Ross, 2010, 2011; Black and McKane,2012; Hauert and Imhof, 2012). They are useful as they areamenable to analysis due to the Markovian property but stillincorporate the aspect of randomness which is vitally importantto accurate modeling of many biological processes (Renshaw,1993; Black and McKane, 2012). In epidemiology, stochasticmodels are especially important when considering smallerpopulations such as households, schools, farms and workplaces(Halloran et al., 2008; Keeling et al., 2010; Fraser et al., 2011).The use of these types of models for inference is becomingmore widespread, thus computational e ﬃ ciency, as well asaccuracy, is a primary concern (Demiris and O’Neill, 2006;Brooks et al., 2011; House et al., 2012). For many models,stochastic simulation has been the only way of calculatingquantities, but this has the drawback of requiring averagingover many realizations for accuracy. With increasing com-puting resources, numerical methods of solution, exact to agiven precision, are now an attractive proposition for these typeof models (Keeling and Ross, 2009; Jenkinson and Goutsias,2012; Black et al., 2013; Black and Ross, 2013).In epidemiology, one of the most important quantities, forboth inference and public health, is the epidemic ﬁnal size dis-tribution. The ﬁnal size is the total number of individuals whohave experienced infection over the course of the epidemic.Thus the ﬁnal size distribution gives the probability of each ofthe possible outcomes of an epidemic. Calculating the ﬁnalsize distribution is the subject of a wide body of literature: seeHouse et al. (2013) for a review. For the classic Markovian ∗ Corresponding author: School of Mathematical Sciences, The Universityof Adelaide, Adelaide SA 5005, Australia. Phone: + Email address: [email protected] (Andrew J. Black)

SIR model, Bailey’s (1953; 1975) method and in particular theimplementation due to Neuts and Li (1996) has been shown tobe superior, being both numerically stable and computationallye ﬃ cient (House et al., 2013). The algorithm can be derivedin a number of ways, but a clear physical interpretation hasbeen lacking. There is also an implicit assumption that theinfectious period is exponentially distributed which is known tobe unrealistic (Keeling and Rohani, 2008). Another procedurewhich allows for more general infectious period distributions isdue to Ball (1986). Unfortunately, this su ﬀ ers from a numberof numerical problems, even for moderate population sizes(House et al., 2013; Demiris and O’Neill, 2006).In this paper we present a new method for the computation ofﬁnal size distributions—applicable to homogeneously-mixingMarkovian models—which is both exact and numericallystable. It is also computationally e ﬃ cient, and physicallytransparent, allowing us to calculate distributions for a rangeof more complex models. Our method is based around aparticular representation of the stochastic process, know asthe degree-of-advancement (DA) representation. This hasbeen recently used in relation to continuous-time Markovchains (Sunkara, 2009; Jenkinson and Goutsias, 2012). It isbased on counting events instead of population numbers and alexicographical ordering of the state space.The basic idea behind our methodology is intuitively simple.The epidemic process can be considered as a random walk,ending in an absorbing state which then determines the ﬁnalsize. The probability of hitting that state is then just the sumof the probabilities of all possible paths of reaching that state.However, it can be di ﬃ cult to enumerate all these paths andcorrectly sum them. In fact, in his original paper, Bailey (1953)considers a suggestion from a colleague for a scheme likethis. He rejects the idea though, because ‘the summation mayleave some doubt as to whether all relevant terms have beenincluded’. We show that adopting the DA representation solves Preprint submitted to Journal of Theoretical Biology July 19, 2018 his problem and summing over all paths becomes trivial asthe jump chain of the process resembles a probability tree.Our method is an advancement on Bailey’s method—andderivatives of it—as it is more transparent and can be extendedto a range of more complex models; it allows us to calculateﬁnal size distributions for models which up until now havebeen intractable to all but simulation, such as those whichinclude waning immunity.The remainder of this paper is as follows. We begin byillustrating the fundamental idea using the SIR model. Wediscuss the methodology and algorithms in some detail becausethe models we consider later are generalizations of this. Inparticular we compute the ﬁnal size distribution for an SIRmodel with a phase-type infectious period distribution and amodel with waning immunity where the number of infectionsis potentially unbounded. Finally we give a discussion of ourresults and their other uses. In particular, we highlight theconnection to the computation of hitting probabilities (Norris,1997). Although we have derived these results by consideringmodels in epidemiology, the basic results are much moregeneral. Stochastic models with a similar structure are nowcommon tools in many areas (Black and McKane, 2012) andhence this methodology will be potentially useful in a widerange of disciplines. MATLAB code for generating all theseresults is provided in the supplementary material.

2. SIR ﬁnal size distribution

We illustrate the basic idea using the well known Marko-vian SIR model (Keeling and Rohani, 2008). The state of theprocess is X ( t ) = ( S , I ), the number of susceptible and infec-tious individuals at time t . The transitions and rates whichdeﬁne the model are given in Table 1. These transitions aregiven in terms of the population numbers, S and I , hence thisis known as the population representation. Instead of this, wework with the DA representation (van Kampen, 1992; Sunkara,2009; Jenkinson and Goutsias, 2012). This involves countingthe number of events of each type instead of the populationnumbers. We therefore deﬁne Z and Z as the number of in-fection and recovery events respectively, and hence the stateof the process at time t is Z ( t ) = ( Z , Z ). The random vari-able Z also counts the ﬁrst infection events, which we take tohave occurred from an outside source and hence sets the initialcondition for the problem. The di ﬀ erence between these rep-resentations is that population numbers can both increase anddecrease, whereas the transition counts can only ever increase.The number of susceptible, infectious and recovered individu-als are then given by, S = N − Z , I = Z − Z , R = Z , (1)respectively, where N is the total population size. Using theserelations the rates of each type of transition can be calculatedin terms of Z and Z . Transition Rate( S , I ) → ( S − , I + β S I ( S , I ) → ( S , I − γ I Table 1: Transitions and rates deﬁning the SIR model. R = N − S − I , where N is the size of the population. Z Z Figure 1: The state space of the SIR model in the DA representation. Z countsthe number of infection events and Z counts the number of recovery events.The states are numbered i = , . . . ,

10 according to their co-lexicographic order.Arrows denote possible transitions and the initial state, z = (1 , z = (0 ,

0) corresponds to all individuals being susceptible.

We index the states of the system by z i = ( z , z ), i = , . . . , Ψ , where Ψ = ( N + N + / z i are the counts of events Z and Z that have occurred in reaching the state i . The states of thesystem are ordered, such that z i ≺ z i + . This means that,( z , z ) ≺ ( z ′ , z ′ ) i ﬀ z < z ′ or z = z ′ and z ≤ z ′ . (2)This is a co-lexicographic ordering , in contrast to that used byJenkinson and Goutsias (2012), which is a lexicographic order-ing. We choose this because it allows for a simpliﬁcation oflater parts of the algorithm needed to calculate the ﬁnal sizedistribution. Figure 1 shows the DA state space for an SIRmodel with N =

3. The states are indexed linearly accordingto their co-lexicographic ordering and arrows indicate possibletransitions between states. In practice, this ordering and the lin-ear indexing of the states is most easily enumerated by using asimple nested loop system: the ﬁrst loop iterates over values of Z = , . . . , N and then a second loop, nested inside the ﬁrst,iterates over values of Z = z , . . . , N . The state index, i , is thensimply assigned by keeping count of the number of states whichhave been iterated over.This ordering means that the stochastic transition matrix (thegenerator), and hence the jump chain of the process is upper tri-angular. The jump chain can now be thought of as a probabilitytree, where the probabilities at the leaves (i.e., the absorbingstates) are found by multiplying and adding probabilities alongthe di ﬀ erent branches. Given that the system starts in state i with probability b i we can simply write down equations for p i ,2he probability of visiting (or ending in) state i , p i = b i + X j α ji p j (3)where the sum on j is over all states which lead to state i and α ji is the probability of entering state i from state j . More gen-erally, this system of equations can be written as( I − A ) p = b , (4)where A is the transpose of the jump chain matrix. As the ma-trix ( I − A ) is lower triangular, this system of equations is solvedvery e ﬃ ciently via forward substitution. This is implementedautomatically in MATLAB. Once we have solved Eq. (4) thenthe ﬁnal size distribution, u = ( u , . . . , u N ), is found by select-ing the elements of p corresponding to absorbing states, i.e.those where Z = Z . For this model we can easily write downan expression for indices of these absorbing states, u j = p j (2 N + − j ) / + , j = , . . . , N . (5)Solving Eq. (4) gives the probabilities of visiting all states inthe system, given that we start from a particular set of states.As the state space for these problems can become very largewe might not want to store the elements of A . Instead, wecan exploit the ordered structure of the state space to derive asimple recursive method for solving equation (3).Firstly we set p = b , the initial state of the system. Next, asdescribed earlier, we use a pair of nested loops to iterate throughthe states of the system in co-lexicographic order. For each statewe calculate the rates of the two possible events, infection andrecovery respectively, as a = β ( N − z )( z − z ) , a = γ ( z − z ) . (6)Note that a and a are functions of the state, but we suppressthis for clarity of the exposition. We then deﬁne the total rate as a = a + a . Remembering that the variable i indexes the statesof the system, we then update the elements of p according to p i + = p i + + p i a / a , a > , p i + N − z = p i + N − z + p i a / a , a > . (7)This reduces the overhead of the calculation because thereis no need to store the elements of the jump matrix and theirpositions. This algorithm relies on being able to calculatethe index of the states which state i feeds into e ﬃ cientlywithout the need for complex search routines or pre-calculationof any quantities. The ordering of the state space makeswhat is potentially a di ﬃ cult and ine ﬃ cient procedure quitestraightforward.If we do not wish to calculate the whole vector p , but onlythe probabilities of hitting the absorbing states (i.e., the ﬁnalsize distribution) then we can alter the algorithm to requireeven less storage. This will become important later when we consider problems with a much larger state space. Tosee the intuition behind this, ﬁrstly note that this algorithmsummarised in Eq. (7) is di ﬀ erent to forward substitution.Each step of the above algorithm calculates the probabilitiesof transitioning to the connected states given that we are instate i . These parts are then added to any existing transitionprobabilities which have already been calculated. In thisway, each p i is calculated piecewise in two steps but becauseof the ordering, once the algorithm reaches state i , all thecontributions to p i will have been added and hence p i can beused to calculate further probabilities. Conversely, forwardsubstitution calculates each p i in one step from probabilitiesalready calculated as in Eq. (3). With the algorithm above,once it has iterated over state i then p i and the reaction rates a and a are no longer required for the rest of the calculation(unless i is an absorbing state, in which case p i forms part ofthe ﬁnal size distribution). Hence memory which was usedto store the probabilities p j , j ≤ i can be reused. This pointunderlies all the other algorithms presented in this paper.The observation just noted may be used to reduce the amountof memory needed for the algorithm as follows. The implemen-tation here assumes that initially the system starts in a state with m infectious individuals. This is the most common situation,but can be extended to handle a situation where there is a distri-bution of starting states with the addition of some complexity.We ﬁrst deﬁne the vector q which will hold working variables / probabilities. The size of this vector is | q | = N + Z =

0, i.e., thebottom row of states in Figure 1. Initially the elements of q areset to zero except q m + =

1, where m is the initial number ofinfectious individuals. The algorithm then proceeds as before,iterating over the states of the system in co-lexicographic or-der. For each state we calculate the rates as in Eq.(6) and thenupdate the elements of q such that, q z + = q z + + q z + a / a , a > , q z + = q z + a / a , a > . (8)We can see that once the probability stored in element q z + has been used to calculate the two new probabilities,it is replaced with a new probability ( q z + a / a ). Oncethe algorithm has completed then q holds the probabilitiesof the ﬁnal size distribution ( u = q ). Note that for thegeneralised algorithm in the next section this is not true andthe required number of working variables is larger than theﬁnal size distribution. This version of the algorithm thusreduces the length of the vector needed to store probabili-ties from ( N + N + / N +

1, although the runningtime is approximately the same as the previous algorithm aswe still need to iterate over all the possible states of the system.As described above, this algorithm is not the most optimizedversion we can conceive of, but we have left it in this formbecause it is easier to see how this generalizes to models withmany more events. For example, we can optimize this to re-move the conditional statements, as we know where the absorb-3ng and boundary states are, but this is more di ﬃ cult for higher-dimensional models. The simplicity of the algorithm makesthis very computationally e ﬃ cient. MATLAB code for eval-uating this and the previous algorithm is given in the supple-mentary material. The implementation of Bailey’s method dueto Neuts and Li (1996) is very similar to the ﬁrst algorithm (7)for calculating the full vector p . Their method uses the popula-tion representation of the process hence it is more complicated(MATLAB code for this is presented in the supplementary ma-terial of House et al. (2013)). In contrast, the algorithm aboveis simpler, requiring less computational operations and mem-ory. There is also a clear physical interpretation as to the actionof the algorithm at each iteration, which is subtly di ﬀ erent tothat of basic forward substitution. As we will now show, it alsoallows us to e ﬃ ciently compute ﬁnal size distributions for muchmore complicated models.

3. Model with a phase-type infectious period distribution

We now consider a model with a phase-type distributionfor the infectious period. This is achieved by splitting theinfectious period up into k stages, hence these are knownas SI( k )R models. These are widely used in epidemiologyas they capture a more realistic infectious proﬁle where thetime an individual remains infectious exhibits less variationabout the mean in comparison to an exponential distribution(Keeling and Rohani, 2008; Black et al., 2009). We do notconsider models with a latent period (such as SEIR) becausethis has no e ﬀ ect on the ﬁnal size distribution (Ball, 1986); fortime dependent quantities this would not be true.If the infectious period is split into k stages then there will be k + k +

1, with theﬁrst event being the infection and events 2 through k + k γ so that the mean du-ration of infection stays ﬁxed at 1 /γ . The transitions and ratesare summarised in Table 2. The size of the state space is now, Φ ( N , k ) = ( N + k + k + N ! = N + k + N ! . (9)This is the number of ways of allocating N individuals to k + S , I , . . . , I k + , R ). The state space and transitions forthis model with N = k = Z ; these shouldthen be visualised as being stacked on top of each other.The ﬁnal size distribution can then be found in the sameway as for the SIR model, by forming the jump chain for theprocess and solving equation (4). An obvious problem withthis approach is that with an increasing number of phases, k ,or population size, N , Φ becomes very large. Thus computingand storing the matrix A becomes expensive, so we need touse a recursive algorithm as discussed in the previous section.In particular, we do not wish to retain the full vector p , butonly wish to calculate u using the least memory possible. This reduction in storage is accomplished, as in the previous section,by exploiting the structure of the state space. As can be seen inFigure 2, the slices of the state space for Z > Z = Z = Z =0

11 12 1314 1516 Z Z

11 12 1314 15 Z =1 Z Z Z Z

17 18 Z Z =2 Z Z =3

17 181920

Figure 2: Illustration of the SI(2)R state space with N = Z . The states are numbered according to their co-lexicographic order. Possibletransitions are shown by arrows with transitions between values of Z denotedby red arrows with the destination states also indicated. The algorithm to calcu-late the ﬁnal size distribution only requires space to hold Φ (3 , =

10 variables,which corresponds to the number of states with Z =

0, i.e. the bottom slice.

For the SI( k )R model, the size of the vector, q , of workingvariables we need to store to compute the ﬁnal size distributionis | q | = Φ ( N , k − Z k + =

0. For the basic SIR model ( k = Φ ( N , = N +

1, which coincides with the number of elementsin the ﬁnal size distribution. For k > q becomes more4 ransition Counter Rate ( S , I ) → ( S − , I + z β ( N − z )( z − z k + )( I j − , I j ) → ( I j − − , I j + z j j = , . . . , k k γ ( z j − − z j )( I k , R ) → ( I k − , R + z k + k γ ( z k − z k + ) Table 2: Transitions and rates deﬁning the SI( k )R model with an Erlang distribution. R = N − S − P kj = I j , where N is the size of the population. Only states thatchange in a transition are presented. complicated. The ratio of the number of states in the system tothe number of elements of q is ( k + / ( N + k + N increases.The algorithm proceeds as follows: Initialization is done bysetting the elements of q to zero, apart from q m + =

1, where m is the initial number of infectious individuals. As before,we iterate through the states of the system in co-lexicographicorder with the use of a nested loop system. Thus we ﬁrstloop over values of Z k + = , . . . , N , which sets the limitsfor a second loop, nested within the ﬁrst, which iterates over Z k = z k + , . . . , N , which sets the limits for a third loop et cetera .The state of the system is then z i = ( z , . . . , z k + ), where i is therunning total of the number of states which have been loopedover so far. In fact i is not actually needed for any calculationsin the algorithm, but is useful for its description.We now have to establish a mapping between a given stateof the system z i and where the probability of reaching that stateis stored within the vector q . This is by done by introducinganother counter ω which indexes q . For each new value of Z k + the algorithm loops over, we calculate ξ = Φ ( N , k − − Φ ( N − z k + , k − + ω = ξ . After each iteration, ω → ω +

1, exceptwhen Z k + changes, in which case it is re-calculated fromEq. (10). So for each value of Z , ω counts through the values ξ, . . . , Φ ( N , k − z i we then calculate the rates of all possibleevents, a j , j = , . . . , k +

1, and their total a = P k + j = a j andupdate the elements of q as follows. For each event type j = , . . . , k we update q l = q l + q ω a j / a , a j > , (11)in order, where l = ω + Φ ( N − z j , j − . (12)For the event of type k +

1, instead, we do q ω = q ω a k + / a , a k + > . (13)These operations are just a generalisation of those given inEq. (8), where in the last step the probability stored in q ω isreplaced by q ω a k + / a , which is the probability of a Z k + eventfrom the current state. Once the algorithm has iterated over all the states of thesystem, the elements of u can be extracted from q . Anotherway to ﬁnd u is to extract the correct probabilities as thealgorithm progresses: if it comes to a state which has zeroprobability of leaving (i.e., when Z k + = Z ), then it is anabsorbing state corresponding to a ﬁnal size, Z . Because ofthe chosen ordering, the order in which absorbing states areencountered matches that of the ﬁnal size distribution. An ex-ample of the output is shown in Figure 3 for the SI(4)R modeland compared with the standard SIR model in the inset. Fullcode for generating this is given in the supplementary materials.If k = ω , becomes redundant. This is because, ξ = Φ ( N , − Φ ( N − z , + = z + , thus for each value of Z , ω would iterate over the values z + , . . . , N +

1. As Z already iterates over z , . . . , N , then wecan simply set ω = z +

1, and remove the ω counter. Finally,the function l , given by Eq. (12), reduces to l = z + Φ ( N − z , − =

1. Thus the update rules in Eqs. (11) and (13)reduce to those in Eq. (8).The power of this recursive approach comes from thesimplicity of the algorithm, and hence the speed of execution.Obviously, this relies on being able to calculate the indexes ω and l analytically, so this information does not need to bestored. Even though the size of the state space for such modelscan be into the hundreds of millions it can be processed veryquickly and numerically it is very stable. This is in contrastto the method of Ball (1986) which, although it uses a muchsmaller set of equations and allows using any distribution forthe infectious period, requires numbers retaining a high degreeof accuracy which is harder for a computer to handle natively(Demiris and O’Neill, 2006).

4. Model with waning immunity

We now use the methodology developed in the previous sec-tions to calculate the total number of infections in a stochas-tic epidemic model with waning immunity. The simplest suchmodel is a susceptible-infected-recovered-susceptible (SIRS)model (Keeling and Rohani, 2008). The transitions deﬁningthis are as in Table 1 with the addition of( S , R ) → ( S + , R −

1) at rate µ R , (14)5 =0

11 12 13 Z =1 Z Z

10 21 22 2325 2628 Z Z

11 12 13 1415 16 1718 19 Z =2

31 32 3335 3638 Z Z

21 22 23 2425 26 2728 29 Figure 4: Illustration of the structure of state space for the SIRS model with N = Z = , ,

2. States are labelled by theirco-lexicographic order. For each value of Z the slice of state space has the same structure as the triangular state space of the SIR model in Figure 1. Possibletransitions between slices of constant Z are shown with the red arrows along with the destination state. p r obab ili t y d i ff e r en c e Figure 3: Main plot shows the probability mass function for a SI(4)R model,with N = ﬀ erence between the pmfs for the Er-lang(4) model and the basic exponential SIR model. Both epidemics start with1 infectious individual. Other parameters: β = / ( N −

1) and γ = where R = N − S − I is the number of recovered individuals.The state space in the population representation now containsloops and there is only one absorbing state corresponding to( S = N , I = , R = Z . Such a model is di ﬀ erent to the SIRmodel presented in Section 2 as there is no upper bound on thetotal number of infection events. Thus, S = N − Z + Z , I = Z − Z , R = Z − Z , (15)and the state space is unbounded.The ﬁrst part of the state space of this model, with N = Z )has the same structure as the SIR model shown in Figure 1.The red arrows indicate Z transitions between di ﬀ erent slices.Importantly, there are no loops in this state space because thecounts can only ever increase. Thus the absorbing states arewhen Z = Z = Z , which corresponds to a ﬁnal size of Z .One way to calculate the ﬁnal size distribution is to simplytruncate the DA state space, form the jump chain and solve Eq. (4). A much better way is to exploit the ordered structureof the state space to create a recursive method. By ordering thestates with respect to the number of Z events, as in Figure 4,we can use Eqs. (15) to derive limits on the possible number of Z and Z events, Z ≤ Z ≤ N + Z , and Z ≤ Z ≤ N + Z . (16)This suggests we make a transformation of variables, Z ′ = Z − Z and Z ′ = Z − Z , so the rates of events Z , Z and Z become, a = β ( N − z ′ )( z ′ − z ′ ) , a = γ ( z ′ − z ′ ) , a = µ z ′ , (17)respectively. Note, that these rates are now independent of z .To recursively compute the ﬁnal size distribution (total num-ber of Z events) we only need to store probabilities for thenumber of states in two slices of the state space. We de-note these working variables as q and h , both of length Ψ = ( N + N + /

2. We truncate the maximum number of Z events, and hence the largest possible observed ﬁnal size at Λ .We then deﬁne the vector u = ( u , . . . , u Λ+ ) which will holdthe ﬁnal size distribution. The last element of u will be proba-bility of observing more than Λ infection events, which is alsocomputed very naturally using this approach.We initialize q , h and u by setting all their elements to zero,apart from q m + =

1, where m is the initial number of infectiousindividuals. The algorithm then proceeds as follows: for eachvalue Z = , . . . , Λ , we iterate over all possible values of Z ′ and Z ′ in co-lexicographic order in exactly the same way asfor the basic SIR model. As in the previous algorithm, wemaintain a counting variable, ω , which counts the number ofstates that have been iterated over for a given value of Z . Themaximum value of ω will be Ψ as this is the number of statesin one slice of the state space.For each state we calculate the rates of the three events from6

10 20 30 40 50 6000.050.10.150.20.25 number of infections p r obab ili t y Figure 5: The probability mass function, u , for the total number of infectionsin an SIRS model with N =

30. The maximum number of infections, Λ , isset to 60. The bar at 61 is then the probability of Z >

60. Parameters are: β = / ( N − γ = µ = . Eq. (17). We then update the elements of q and h as follows: q ω + = q ω + + q ω a / a , a > , q ω + N − z ′ = q ω + N − z ′ + q ω a / a , a > , h ω − N + z ′ − = q ω a / a , a > , (18)where a = a + a + a . The ﬁrst two equations in (18) havethe same form as for the SIR model given in Eq. (7). In deter-mining the element of h to update we have taken into accountthe transformation of variables above. Once the algorithm hasiterated over all values of Z ′ and Z ′ we can update u as, u z = q . (19)This follows because, for a given value of Z , the absorbingstate is when Z ′ = Z ′ = q . The algorithm then continues by setting q = h , then h = Z . Once all values of Z have been iterated over, the probability of more than Λ infectionevents is simply u Λ+ = Ψ X j = q j . (20)This sum is over the elements of q because in the last step of thealgorithm the probabilities stored in h are moved to q . Alterna-tively, u Λ+ could be calculated as 1 − P Λ j = u j . Because of theease with which we can calculate this probability, an alternativeway to formulate the algorithm is to only terminate the recur-sion over Z once the probability of more than a given numberof infections falls below some pre-deﬁned threshold. Figure 5shows the probability mass function of the total number of in-fections for an SIRS model with N =

5. Discussion and Conclusion

We have presented a new method for computing the ﬁnalsize distribution for a number of epidemic models. Using theDA representation of the stochastic process means the jumpchain has a particularly simple form resembling a probabilitytree. This structure allows the derivation of an iterative method for calculating the ﬁnal size distribution. Although similarmethods have been proposed for the SIR model (Neuts and Li,1996), ours is both the simplest to implement and understandand is also more e ﬃ cient computationally and in terms ofmemory usage. The biggest advantage of our method is itsstraightforward applicability to more complex models. Wehave demonstrated this by computing ﬁnal size distributionsfor the SI( k )R model and the SIRS model, where the number ofinfection events is unbounded. This opens up these models foruse in inference work using ﬁnal size data (House et al., 2013).Although we have derived and illustrated these methods us-ing models within epidemiology, they are much more general.They are applicable to Markovian models in which we wishto calculate hitting probabilities (MacNamara and Burrage,2010) or where probabilities of given paths through a statespace are required (Nowak and Chou, 2009; Williams et al.,2013). In ecology and epidemiology we are often interested ino ﬀ spring distributions. This is the distribution of the number ofsecondary entities a single entity gives rise to over a particularlifetime. Ross (2011) has shown how o ﬀ spring distributionsfor a stochastic model can be computed by recursively solvingsets of linear equations. With the methodology presented here,these quantities could be computed more e ﬃ ciently in muchthe same way as presented for the SIRS model.A limitation on this method is that as more event types areadded the size of the state space grows as described by Eq. (9).Thus although typically we are not limited by memory due tothe recursive nature of the solutions, the time to compute adistribution will grown linearly with the size of the state space.As an example of times for computation, the SI(4)R modelshown in Figure 3 has N =

100 and Φ ≈ . × ; computationof the ﬁnal size distribution is approximately 3.1 seconds usingan algorithm programmed in C on a 2.5GHz machine. Thesame model with N =

200 has Φ ≈ × ; this took 1.8minutes to process. By normal standards, these are very largestate spaces. One advantage of the Ball method is that it canhandle any infectious period distribution for which we canwrite down the Laplace transform (Ball, 1986), whereas ourmethod is limited to phase-type distributions.For very large values of k , the number of stages in the infec-tious period, Ball’s method combined with a variable precisionarithmetic (Demiris and O’Neill, 2006) might outperform ourmethod, depending on the implementation of the variable pre-cision arithmetic. However, for values of k typically of interest,our method is the most e ﬃ cient. Another important class ofepidemic models are those with a structured population such ashousehold or network models (Ball et al., 1997; Danon et al.,2011). These have such large state spaces, even for modestpopulation sizes, that our method would become impractical.For such models a number of good techniques have beendeveloped (O’Neill, 2009; Neal, 2012).Fundamentally the algorithm we have presented is a serialprocess, so there is little possibility of using parallel compu-7ational resources to speed up the procedure. One promisingdirection which could allow even larger models is the use ofan adaptive mechanism which limits the algorithm to the partsof the state space where most probability is concentrated. Theordering of the state space lends itself to a scheme like thisso, especially as the dimensionality of the problem increases,this could provide large savings with little loss of accuracy.For larger values of N there is also the possibility of derivingapproximate analytic solutions from asymptotic results.There are two other areas where this methodology ispotentially important. The DA representation was originallyconceived as a way of speeding up the integration of theforward equation which describes the temporal dynamicsof the system (Jenkinson and Goutsias, 2012). This methodbecomes slow as the size of the state space increases, thusit would be very beneﬁcial to be able to restrict the statespace to a smaller region as in ﬁnite-state projection methods(Munsky and Khammash, 2006). Our methodology providesa way to calculate which regions of the state space should bekept by solving for the vector p which provides informationabout the most probable paths through the system. Anotherpotentially important use of this methodology is in computingthe probability of hitting a ﬁxed state from a given set of initialstates (Norris, 1997). We often want to calculate these quanti-ties when working with conditioned Markov chains (Waugh,1958). This quantity can be computed in a very similar way tothe ﬁnal size distribution except that the iteration over the statesof the system proceeds backwards from the ﬁnal state. Againthe structure of the state space makes this straightforward toimplement. Acknowledgements

This research was supported under the Australian ResearchCouncil’s Discovery Project (DP110102893) and Future Fel-lowship (FT130100254) funding schemes. We would also liketo thank the two anonymous referees whose extensive com-ments and careful reading have improved this paper.

References

Bailey, N. T., 1953. The total size of a general stochastic epidemic. Biometrika40, 177–185.Bailey, N. T., 1975. The mathematical theory of epidemics. Charles Gri ﬃ n andcompany limited.Ball, F., 1986. A uniﬁed approach to the distribtuion of total size and total areaunder the trajectory of infectives in epidemic models. Adv. App. Prob. 18,289–310.Ball, F., Mollison, D., Scalia-Tomba, G., 1997. Epidemics with two levels ofmixing. Ann. App. Prob. 7 (1), 46–89.Black, A. J., House, T., Keeling, M. J., Ross, J. V., 2013. Epidemiological con-sequences of household-based antiviral prophylaxis for pandemic inﬂuenza.J. R. Soc. Interface 10, 20121019.Black, A. J., McKane, A. J., 2012. Stochastic formulation of ecological modelsand their applications. Trends Ecol. Evo. 27, 337–345.Black, A. J., McKane, A. J., Nunes, A., Parisi, A., 2009. Stochastic ﬂuctuationsin the susceptible-infectious-recovered model with distributed infectious pe-riods. Phys. Rev. E. 80, 021922. Black, A. J., Ross, J. V., 2013. Estimating a Markovian epidemic model usinghousehold serial interval data from the early phase of an epidemic. PLoSONE 8, e73420.Brooks, S., Gelman, A., Jones, G. L., Meng, X.-L. (Eds.), 2011. Handbook ofMarkov Chain Monte Carlo. CRC Press.Danon, L., House, T., Jewell, C. P., Keeling, M., Roberts, G. O., Ross, J. V.,Vernon, M. C., 2011. Networks and the epidemiology of infectious disease.Interdiscip. Perspect. Infect. Dis. 2011, 1–28.Demiris, N., O’Neill, P. D., 2006. Computation of ﬁnal outcome probabilitiesfor the generalised stochastic epidemic. Statistics and Computing 16, 309–317.Fraser, C., Cummings, D. A. T., Klinkenberg, D., Burke, D. S., Ferguson,N. M., 2011. Inﬂuenza transmission in households during the 1918 pan-demic. Am. J. Epidemiol. 174, 505–514.Halloran, M. E., Ferguson, N. M., Eubank, S., Longini, I. M., Cummings,D. T. A., Lewis, B., Xu, S., Fraser, C., Vullikanti, A., Germann, T. C.,2008. Modeling targeted layered containment of an inﬂuenza pandemic inthe United States. Proc. Natl. Acad. Sci. USA 105, 4639–4644.Hauert, C., Imhof, L. A., 2012. Evolutionary games in deme structured, ﬁnitepopulations. J. Theor. Biol. 299, 106–112.House, T., Inglis, N., Ross, J. V., Wilson, F., Suleman, S., Edeghere, O., Smith,G., Olowokure, B., Keeling, M. J., 2012. Estimation of outbreak severityand transmissibility: Inﬂuenza A(H1N1)pdm09 in households. BMC Med.10, 117.House, T., Ross, J. V., Sirl, D., 2013. How big is an outbreak likely to be?methods for epidemic ﬁnal size calculation. Proc. R. Soc. A 469, 20120436.Jenkinson, G., Goutsias, J., 2012. Numerical integration of the master equationin some models of stochastic epidemiology. PLoS ONE 7, e36160.van Kampen, N. G., 1992. Stochastic processes in physics and chemistry. Else-vier, Amsterdam.Keeling, M. J., Danon, L., Vernon, M. C., House, T. A., 2010. Individual iden-tity and movement networks for disease metapopulations. Proc. Natl. Acad.Sci. USA 107, 8866–8870.Keeling, M. J., Rohani, P., 2008. Modeling Infectious Diseases in Humans andAnimals. Princeton University Press, New Jersey.Keeling, M. J., Ross, J. V., 2009. E ﬃ cient methods for studying stochastic dis-ease dynamics. Theor. Popul. Biol. 75, 133–141.MacNamara, S., Burrage, K., 2010. Stochastic modeling of nave T cell home-ostasis for competing clonotypes via the master equation. Multiscale Model.Sim. 8 (4), 1325–1347.Munsky, B., Khammash, M., 2006. The ﬁnite state projection algorithm for thesolution of the chemical master equation. J. Chem. Phys. 124, 044104.Neal, P. J., 2012. E ﬃ cient likelihood-free Bayesian computation for householdepidemics. Stat. Comput. 22, 1239–1256.Neuts, M. F., Li, J., 1996. In: Heyde, C. C., Prohorov, Y. V., Pyke, R., Rachev,S. T. (Eds.), Athens Conference on Applied Probability and Time SeriesAnalysis. Springer, New York, Ch. An algorithmic study of S-I-R stochasticepidemic models.Norris, J. R., 1997. Markov chains. Cambridge University Press, Cambridge.Nowak, S. A., Chou, T., 2009. Mechanisms of Receptor / Coreceptor-Mediatedentry of enveloped viruses. Biophys. J. 96 (7), 26242636.O’Neill, P. D., 2009. Bayesian inference for stochastic multitype epidemics instructured populations using sample data. Biostat. 10, 779–791.Renshaw, E., 1993. Modelling Biological Populations in Space and Time. Cam-bridge University Press, Cambridge.Ross, J. V., 2010. Computationally exact methods for stochastic periodic dy-namics: Spatiotemporal dispersal and temporally forced transmission. J.Theor. Biol. 262, 14–22.Ross, J. V., 2011. Invasion of infectious diseases in ﬁnite, homogeneous popu-lations. J. Theor. Biol. 289, 83–89.Sontag, L., Lorincz, M., Luebeck, E., 2006. Dynamics, stability and inheritanceof somatic DNA methylation imprints. J. Theor. Biol. 242, 890–899.Sunkara, V., 2009. The chemical master equation with respect to reactioncounts. In: Proc. 18th World IMACS / MODSIM Congress. pp. 703–707.Waugh, W. A. O., 1958. Conditioned Markov processes. Biometrika 45, 241–249.Williams, B. P., Johnston, I. G., Covsho ﬀ , S., Hibberd, J. M., 2013. Phenotypiclandscape inference reveals multiple evolutionary paths to C4 photosynthe-sis. eLife 2, e00961., S., Hibberd, J. M., 2013. Phenotypiclandscape inference reveals multiple evolutionary paths to C4 photosynthe-sis. eLife 2, e00961.