Effects of Contact Network Models on Stochastic Epidemic Simulations
EEffects of Contact Network Models onStochastic Epidemic Simulations
Rehan Ahmad and Kevin S. Xu
EECS Department, University of Toledo, Toledo, OH 43606, USA
[email protected] , [email protected]
Abstract.
The importance of modeling the spread of epidemics througha population has led to the development of mathematical models forinfectious disease propagation. A number of empirical studies have col-lected and analyzed data on contacts between individuals using a varietyof sensors. Typically one uses such data to fit a probabilistic model ofnetwork contacts over which a disease may propagate. In this paper,we investigate the effects of different contact network models with vary-ing levels of complexity on the outcomes of simulated epidemics usinga stochastic Susceptible-Infectious-Recovered (SIR) model. We evaluatethese network models on six datasets of contacts between people in avariety of settings. Our results demonstrate that the choice of networkmodel can have a significant effect on how closely the outcomes of anepidemic simulation on a simulated network match the outcomes on theactual network constructed from the sensor data. In particular, preserv-ing degrees of nodes appears to be much more important than preservingcluster structure for accurate epidemic simulations.
Keywords: network model, stochastic epidemic model, contact net-work, degree-corrected stochastic block model
The study of transmission dynamics of infectious diseases often involves simula-tions using stochastic epidemic models. In a compartmental stochastic epidemicmodel, transitions between compartments occur randomly with specified prob-abilities. For example, in a stochastic Susceptible-Infectious-Recovered (SIR)model [4,10], a person may transition from S to I with a certain probabilityupon contact with an infectious person, or a person may transition from I to Rwith a certain probability to simulate recovering from the disease.The reason for the spread of infection is contact with the infectious indi-vidual. Hence, the contact network in a population is a major factor in thetransmission dynamics. Collecting an actual contact network over a large popu-lation is difficult because of limitations in capturing all the contact information.This makes it necessary to represent the network with some level of abstraction,e.g. using a statistical model. A variety of statistical models for networks havebeen proposed [9]; such models can be used to simulate contact networks thatresemble actual contact networks. a r X i v : . [ c s . S I] J u l
10 20 30
Time F r a c t i on o f popu l a t i on AreaModelActual (a) Susceptible
Time F r a c t i on o f popu l a t i on (b) Infectious Time F r a c t i on o f popu l a t i on (c) Recovered Fig. 1: For each of the susceptible (S), infectious (I), and recovered (R) compart-ments, the mean curve for simulations on the model (shown in blue) is comparedto the mean curve for simulations on the actual network (shown in red). Thecloseness between the model and actual network is given by the sum of theshaded areas between the curves for each compartment (smaller is better).Our aim in this paper is to evaluate different models for contact networks inorder to find the best model to use to simulate contact networks that are close toan actual observed network. We do this by comparing the disease dynamics of astochastic SIR model over the simulated networks with the disease dynamics overthe actual network. One commonly used approach is to compare the epidemicsize at the end of the simulation, i.e. what fraction of the population caughtthe disease [19,25]. A drawback of this approach is that it only considers thesteady-state outcome and not the dynamics of the disease as it is spreading.We propose to compare the dynamics at each time instant in the simulationby calculating the area between the mean SIR curves for the epidemic over thesimulated and actual networks, shown in Fig. 1. A small area indicates that thedynamics of the epidemic over the simulated contact networks are close to thoseof the actual network. We use this approach to compare four contact networkmodels (in increasing order of number of parameters): the Erd˝os-R´enyi model,the degree model, the stochastic block model, and the degree-corrected stochasticblock model. Our experiment results over six different real network datasetssuggest that the degree-corrected stochastic block model provides the closestapproximation to the dynamics of an epidemic on the actual contact networks.Additionally, we find that preserving node degrees appears to be more importantthan preserving community structure for accuracy of epidemic simulations.
A significant amount of previous work deals with the duration [23], frequency[17], and type [6,24] of contacts in a contact network. These findings are oftenincorporated into simulations of epidemics over different types of contact models.The R package EpiModel [13] allows for simulation of a variety of epidemics overable 1: Summary statistics from datasets used in this study.
HYCCUPS Friends &Family HighSchool Infectious PrimarySchool HOPENumber of nodes 43 123 126 201 242 1178Sensor type Wi-Fi Bluetooth RFID RFID RFID RFIDProximity range N/A 5 m 1–1 . . . .
326 0 .
228 0 .
217 0 . .
285 0 . .
604 0 .
496 0 .
522 0 .
459 0 .
480 0 . . . . .
56 68 . temporal exponential random graph models for contact networks and has beenused in studies of various different infectious diseases including HIV [14].There has also been prior work simulating the spread of disease over a varietyof contact network models with the goal of finding a good approximation to theactual high resolution data in terms of the epidemic size, i.e. the final number ofpeople infected [19,25]. Such work differs from our proposed area metric, whichconsiders the dynamics as the disease is spreading and not just the steady-state outcome. In [3], the authors use the squared differences between the Icurves (fraction of infectious individuals) of an epidemic model on simulatedcontact networks and on an actual contact network to calibrate parameters of theepidemic model when used on simulated contact networks. Although this metricdoes consider the dynamics of the epidemic, our proposed metric also involvesthe S and R curves for a more complete evaluation of population dynamics. We consider a variety of contact network datasets in this paper. Table 1 showssummary statistics for each dataset along with the sensor type. The HYCCUPSdataset was collected at the University Politehnica of Bucharest in 2012 usinga background application for Android smartphones that captures a device’s en-counters with Wi-Fi access points [20]. The Friends & Family (F&F) dataset wascollected from the members of a residential community nearby a major researchuniversity using Android phones loaded with an app that records many featuresincluding proximity to other Bluetooth devices [2]. The High School (HS) datasetwas collected among students from 3 classes in a high school in Marseilles, France[7] using wearable sensors that capture face-to-face proximity for more than 20seconds. The Infectious dataset was collected at a science gallery in Dublin us-ing wearable electronic badges to sense sustained face-to-face proximity betweenvisitors. [12]. We use data for one arbitrarily selected day (April 30) on which201 people came to visit. The Primary School (PS) dataset was collected over232 students and 10 teachers at a primary school in Lyon, France in a similarmanner to the HS dataset [8]. Lastly, the HOPE dataset is collected from theAttendee Meta-Data project at the seventh Hackers on Planet Earth (HOPE)onference [1]. We create a contact network where the attendees at each talkform a clique; that is, each person is assumed to be in contact with every otherperson in the same room, hence why this network is much denser.
We construct actual networks from the datasets by connecting the individuals(nodes) with an edge if they have a contact at any point of time. We evaluate thequality of a contact network model for simulations of epidemics by conductingthe following steps for each dataset:1. Simulate 5 ,
000 epidemics over the actual network.2. Fit contact network model to actual network.3. Simulate 100 networks from contact network model. For each simulated net-work, simulate 50 epidemics over the network for 5 ,
000 epidemics total.4. Compare the results of the epidemic simulations over the actual networkwith those over the simulated networks.These steps are repeated for each contact network model that we consider.We describe the stochastic epidemic model we use to simulate epidemics in Sec-tion 4.1 and the contact network models we use in Section 4.2. To get a fairevaluation of the dynamics of epidemics spreading over different contact net-work models, all of the parameters which are not related to the contact networkmodel, e.g. probability of infection and probability of recovery are kept constant.Our aim is to single out the effect of using a particular contact network modelwhile simulating an epidemic.
An actual infection spread in a population experiences randomness in several fac-tors which may aggravate or inhibit the spread. This is considered in stochasticepidemic models. The initial condition is, in general, to have a set of infectiousindividuals, while the rest of the population is considered susceptible. We con-sider a discrete-time process, where at each time step, the infectious individualscan spread the disease with some probability of infection to susceptible individ-uals they have been in contact with. Also, the infectious individuals can recoverfrom the disease with some probability independent of the individuals’ contactswith others. This model is known as the stochastic SIR model and is one of thestandard models used in epidemiology [4,10].We randomly choose 1 infectious individual from the population as the initialcondition and simulate the epidemic over 30 time steps. We set the probabilityof infection for every interaction between people to be 0 . . .2 Contact Network Models In practice, it is extremely difficult to obtain accurate contact network data. Analternative is to simulate a contact network by using a statistical network model.We consider several such models, which we briefly describe in the following. Werefer interested readers to the survey by Goldenberg et al. [9] for details.
Erd˝os-R´enyi (E-R) Model
In the E-R model, an edge between any twonodes is formed with probability p independent of all other edges. To fit the E-Rmodel to a network, set the single parameter, the estimated edge probabilityˆ p = M/ (cid:0) N (cid:1) , where N and M denote the number of nodes and edges in theactual network, respectively. By doing so, the expected number of edges in theE-R model will be (cid:0) N (cid:1) ˆ p = M , the number of edges in the actual network. Degree Model
In several network models, including the configuration modeland preferential attachment models, the edge probability depends upon the de-grees of the nodes it connects [21]. We consider a model that preserves theexpected rather than actual degree of each node, often referred to as the Chung-Lu model [5]. In this model, the probability of an edge between two nodes isproportional to the product of their node degrees, and all edges are formed in-dependently. The model has N parameters, the expected degrees of each node.To fit the degree model to a network, we compute the degrees of all nodes toobtain the degree vector d . We then set the estimated edge probabilities ˆ p ij = αd i d j , where the constant α is chosen so that the sum of all edge probabilities(number of expected edges) is equal to the number of edges in the actual network. Stochastic Block Model (SBM)
In the SBM [11], the network is dividedinto disjoint sets of individuals forming K communities. The probability of edgeformation between two nodes depends only upon the communities to which theybelong. This model takes as input a vector of community assignments c (length N ) and a matrix of edge formation probabilities Φ (size K × K ), where φ ab denotes the probability that a node in community a forms an edge with a nodein community b , independent of all other edges. For an undirected graph, Φ issymmetric so the SBM has N + (cid:0) K +12 (cid:1) parameters in total.To estimate community assignments, we use a regularized spectral clusteringalgorithm [22] that is asymptotically consistent and has been demonstrated tobe very accurate in practice. We select the number of communities using theeigengap heuristic [18]. Once the community assignments ˆ c are estimated, theedge probabilities can be estimated by ˆ φ ab = m ab /n ab , where m ab denotes thenumber of edges in the block formed by the communities a, b in the observednetwork, and n ab denotes the number of possible edges in the block [16]. Degree-corrected Stochastic Block Model (DC-SBM)
The DC-SBM isan extension to the SBM in a way that incorporates the concepts of the degreeodel within an SBM [16]. The parameters of the DC-SBM are the vector ofcommunity assignments c (length N ), a node-level parameter vector θ (length N ), and a block-level parameter matrix Ω (size K × K ). In a DC-SBM, an edgebetween a node i ∈ a (meaning node i is in community a ) and node j ∈ b isformed with probability θ i θ j ω ab independent of all other edges. Ω is symmetric,so the DC-SBM has 2 N + (cid:0) K +12 (cid:1) parameters in total.To fit the DC-SBM to an actual network, we first estimate the communityassignments in the same manner as in the SBM using regularized spectral clus-tering. We then estimate the remaining parameters to be ˆ θ i = d i / (cid:80) j ∈ a d j , fornode i ∈ a , and ˆ ω ab = m ab [16]. Using these estimates, we arrive at the estimatededge probabilities ˆ p ij = ˆ θ i ˆ θ j ˆ ω ab . To evaluate the quality of a contact network model, we compare the mean SIRcurves resulting from epidemic simulations on networks generated from thatmodel to the mean SIR curves from epidemic simulations on the actual network.If the two curves are close, then the network model is providing an accuraterepresentation of what is likely to happen on the actual network.To measure the closeness of the two sets of mean SIR curves, we use thesum of the areas between each set of curves as shown in Fig. 1. By measuringthe area between the curves rather than just the final outcome of the epidemicsimulation (e.g. the fraction of recovered people after the disease dies out as in[19,25]), we capture the difference in transient dynamics (e.g. the rate at whichthe infection spreads) rather than just the difference in final outcomes.The area between the SIR curves for each model over each dataset is shownin Fig. 2a. According to this quality measure, the DC-SBM is the most accuratemodel on F&F, HS, and PS; the degree model is the most accurate on HYCCUPSand HOPE; and the SBM is most accurate on Infectious. However, the SBMappears to be only slightly more accurate than the E-R model overall, despitehaving N + (cid:0) K +12 (cid:1) parameters compared to the single parameter E-R model. Thecontact network models were most accurate on the HOPE network, which is thedensest, causing the epidemics to spread rapidly.We compute also the log-likelihood for each contact network model on eachdataset, shown in Fig. 2b. To normalize across the different sized networks, wecompute the log-likelihood per node pair. Since all of the log-likelihoods areless than 0, we show the negative log-likelihood (i.e. lower is better) in Fig. 2b.Unsurprisingly, the DC-SBM, with the most parameters, also has the highest log-likelihood, whereas the relative ordering of the log-likelihoods of the degree modeland SBM, both with roughly the same number of parameters, vary dependingon the dataset.Both the proposed area between SIR curves and the log-likelihood can beviewed as quality measures for a contact network model. A third quality measureis given by the number of parameters, which denotes the simplicity of the model.A simpler model is generally more desirable to avoid overfitting. These three Y C C U
P S F & F H S I n f e c t i ou s P S H O P E A r ea be t w een S I R c u r v e s E-R Degree SBM DC-SBM (a) H Y C C U
P S F & F H S I n f e c t i ou s P S H O P E N ega t i v e Log - L i k e li hood E-R Degree SBM DC-SBM (b)
Fig. 2: Comparison of (a) area between SIR curves of each model with respectto actual network for each dataset and (b) negative log-likelihood per node pairfor each model (lower is better for both measures). The DC-SBM model appearsto be the best model according to both quality measures, but the two measuresdisagree on the quality of the degree model compared to the SBM.Table 2: Quality measures (lower is better) averaged over all datasets for eachmodel. Best model according to each measure is shown in bold.
Quality Measure E-R Degree SBM DC-SBMArea between SIR curves 1 .
82 0 .
73 1 . . Negative log-likelihood per node pair 0 .
597 0 .
496 0 . . Number of parameters
319 328 647 quality measures for each model (averaged over all datasets) are shown in Table2. The DC-SBM achieves the highest quality according to the area between SIRcurves and the log-likelihood at the expense of having the most parameters. Onthe other hand, the E-R model has only a single parameter but is the worst inthe other two quality metrics. Interestingly, the degree model and SBM appearto be roughly equal in terms of the number of parameters and log-likelihood,but the area between SIR curves for the two models differs significantly. Thissuggests that the degree model may be better than the SBM at reproducingfeatures of contact networks that are relevant to disease propagation.
The purpose of our study was to evaluate the effects of contact network modelson the results of simulated epidemics over the contact network. While it is well-known and expected that more complex models for contact network topologydo a better job of reproducing features of the contact network such as degreedistribution and community structure, we demonstrated that, in general, theylso result in more accurate epidemic simulations. That is, the results of simu-lating an epidemic on a more complex network model are usually closer to theresults obtained when simulating the epidemic on the actual network than if wehad used a simpler network model. Moreover, models that preserve node degreesare shown to produce the most accurate epidemic simulations. Unlike most priorstudies such as [19,25], we measure the quality of a network model by its areabetween SIR curves compared to the SIR curve of the actual network, whichallows us to capture differences while the disease is still spreading rather thanjust the difference in the final outcome, i.e. how many people were infected.Our findings suggest that the degree-corrected stochastic block model (DC-SBM) is the best choice of contact network model in epidemic simulations be-cause it resulted in the minimum average area between SIR curves. Interestingly,using the degree model resulted in an average area between SIR curves to beonly slightly larger than the DC-SBM despite having less than half as manyparameters, as shown in Table 2. The SBM (without degree correction) also hashalf as many parameters as the DC-SBM, but has over twice the area betweenSIR curves. We note that the difference between the degree model and the SBM cannot be observed using log-likelihood as the quality measure, as both modelsare very close in log-likelihood. This leads us to believe that preserving degreehas a greater effect on accuracy of epidemic simulations than preserving commu-nity structure. Furthermore, this finding demonstrates that one cannot simplyevaluate the accuracy of a contact network model for epidemic simulations onlyby examining goodness-of-fit on the actual contact network!In practice, one cannot often collect high-resolution contact data on a largescale, so having accurate contact network models is crucial to provide realis-tic network topologies on which we can simulate epidemics. In this paper, weestimated the parameters for each contact network model using the contact net-work itself, which we cannot do in practice because the contact network is oftenunknown. As a result, one would have to estimate the model parameters fromprior knowledge or partial observation of the contact network, which introducesadditional error that was not studied in this paper. It would be of great interestto perform this type of sensitivity analysis to identify whether the DC-SBM anddegree model are still superior even when presented with less accurate parame-ter estimates. Also, there is a risk of overfitting in more complex models whichshould be examined in a future extension of this work. Both issues could po-tentially be addressed by considering hierarchical Bayesian variants of networkmodels such as the degree-generated block model [27], which add an additionalgenerative layer to the model with a smaller set of hyperparameters.Another limitation of this study is our consideration of static unweightednetworks. Prior work [15,19,23,25] has shown that it is important to consider thetime duration of contacts between people, which can be reflected as weights in thecontact network, as well as the times themselves, which can be accommodatedby using models of dynamic rather than static networks, such as dynamic SBMs[26]. We plan to expand this work in the future by incorporating models ofweighted and dynamic networks to provide a more thorough investigation. eferences
1. aestetix, Petro, C.: CRAWDAD dataset hope/amd (v. 2008-08-07). Downloadedfrom http://crawdad.org/hope/amd/20080807 (2008)2. Aharony, N., Pan, W., Ip, C., Khayal, I., Pentland, A.: Social fMRI: Investigatingand shaping social mechanisms in the real world. Pervasive and Mobile Computing7(6), 643–659 (2011)3. Bioglio, L., G´enois, M., Vestergaard, C.L., Poletto, C., Barrat, A., Colizza, V.:Recalibrating disease parameters for increasing realism in modeling epidemics inclosed settings. BMC Infectious Diseases 16(1), 676 (2016)4. Britton, T.: Stochastic epidemic models: a survey. Mathematical Biosciences225(1), 24–35 (2010)5. Chung, F., Lu, L.: The average distances in random graphs with given expecteddegrees. Proceedings of the National Academy of Sciences 99(25), 15879–15882(2002)6. Eames, K.: Modeling disease spread through random and regular contacts in clus-tered populations. Theoretical Population Biology 73(1), 104–111 (2008)7. Fournet, J., Barrat, A.: Contact patterns among high school students. PLoS ONE9(9), e107878 (2014)8. Gemmetto, V., Barrat, A., Cattuto, C.: Mitigation of infectious disease at school:targeted class closure vs school closure. BMC Infectious Diseases 14(1), 695 (2014)9. Goldenberg, A., Zheng, A.X., Fienberg, S.E., Airoldi, E.M.: A survey of statisticalnetwork models. Foundations and Trends in Machine Learning 2(2), 129–233 (2010)10. Greenwood, P., Gordillo, L.: Stochastic epidemic modeling. In: Chowell, G., Hy-man, J.M., Bettencourt, L.M.A., Castillo-Chavez, C. (eds.) Mathematical andstatistical estimation approaches in epidemiology, pp. 31–52. Springer, Dordrecht(2009)11. Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: First steps.Social Networks 5(2), 109–137 (1983)12. Isella, L., Stehl, J., Barrat, A., Cattuto, C., Pinton, J., Van den Broeck, W.: What’sin a crowd? Analysis of face-to-face behavioral networks. Journal of TheoreticalBiology 271(1), 166–180 (2011)13. Jenness, S., Goodreau, S.M., Morris, M.: EpiModel: Mathematical modeling ofinfectious disease (2017), http://epimodel.org/http://epimodel.org/