A stochastic geospatial epidemic model and simulation using an event modulated Gillespie algorithm
Alexander Temerev, Liudmila Rozanova, Olivia Keiser, Janne Estill
AA stochastic geospatial epidemic model andsimulation using an event modulated Gillespiealgorithm
Alexander Temerev, Liudmila Rozanova
Institute of Global Health, University of Geneva
ABSTRACT
We developed a model and an instrument for stochastic simulations of spreading of COVID-19and other similar infectious diseases, that takes into account both contact network structures andgeographical distribution of population density, detailed up to a level of location of individuals.Our analysis framework includes the surrogate model (SuMo) optimization process for quickfitting of the model’s parameters to the observed epidemic curves for cases, hospitalizations anddeaths. This set of instruments (the model, the simulation code, and the optimizer) can be auseful tool for policymakers and epidemic response teams who can use it to forecast epidemicdevelopment scenarios in local environments (on the scale from towns to large countries) anddesign optimal response strategies. The simulation code also includes a geospatial visualizationsubsystem, presenting detailed views of epidemic scenarios directly on population density maps.We used the developed framework to draw predictions for COVID-19 spreading in Switzerland,on the level of individual cantons; their difference in population density distribution accounts tosignificant variety in epidemic curves and consequently the choice of optimal response strategy.
KEYWORDS
COVID-19, stochastic epidemic modeling, Gillespie algorithm, complex network, contact matri-ces, population density, epidemic simulation
INTRODUCTION
There are numerous epidemic modeling and simulation toolkits available as of 2020, in the rangefrom educational toy models to global-scale comprehensive frameworks (e.g. GLEAMviz [1],which accounts for global air transport networks, commuting patterns, temporal changes ofepidemic parameters, etc). While working with rough-scale global epidemic dynamics tends tobe somewhat less difficult than detailed local simulations, as numerous differences in epidemicspreading patterns and response options tend to be averaged away, the global models are lessuseful for policymakers as there is no global pandemic response organization that can work ontruly international scale.As epidemic response organizations do not work with unlimited resources (and in fact, availableresources can be quite limited in e.g. developing countries), optimal resource allocation is crucialfor developing the most efficient counter-epidemic strategy. It includes determining the most a r X i v : . [ q - b i o . P E ] J a n ulnerable population segments, the most probable epidemic hotspots, and the most impactfuldates and locations for introducing non-pharmaceutical interventions. Our framework employsage-based and context-based contact matrices to account for differences in social networks andcontact patterns of people within different age groups; it uses detailed population density mapsto predict (and visualize) geographical locations of epidemic hotspot clusters, and it employssurrogate modeling optimization system to adapt itself for real observations and medical statisticsavailable for the chosen simulation area, enabling operators to choose most impactful responseoptions. EPIDEMIC PROCESS DESCRIPTION
General assumptions
We use a standard representation the population as a network consists of N nodes (individuals) { i } and a set of links { e ij } , representing a contact between each nodes i and j . We assume thatthe number of individuals in the network remains stable during the simulation [2].In our case the network is weighted – all links e ij have the weights depending of the spatialdistance between each pair of nodes i and j ) and complete i.e. we assume that all nodes areconnected each others.For the description of the infectious process we construct a stochastic SEIR model on the network,so all nodes are in one of the four states: S (susceptible), E (exposed in the latent period), I (infected) or R (recovered/removed).Stochastic infection process S → E occurs as a result of contact between a susceptible ( i ∈ S )and infectious ( j ∈ I ) individual. This transition generally depends on the frequency of contactsand proportional to the infection rate β [2]. Simultaneously, two additional random processestake place in the network: the transition of individuals from the exposed to infectious state E → I and the removal process I → R .The model assumes two types of contact events, which we call as “local” and “remote” contacts andare distinguished using the distance matrix M = { m ij } . Thus, we take into account the spreadof infection in some local area/region and the movement of individuals between the regions.In a local area, we assume that each node can meet any other located in this area with the sameprobability. Thus, we introduce a cutoff r on the distance function d ( m ij ) , which is the samefor all nodes in the network. For each node i in the circle of the radius r with the center in i alldistances are assumed to be the same length: d ( m ij ) = r .Remote spreading is modelled as a separate process where each infectious individual can spon-taneously transmit the infection to a random target selected from the entire population (whichautomatically normalizes it for population density, as the target will be located with more prob-ability in densely populated areas), and flips it to the exposed state.Thus, in our model, the following parameters affect the contact probability:ransition Type Mean rate State change S → E l Local contagion β lsi ( s t − , e t + 1 , i t ) S → E d Remote contagion β dsi ( s t − , e t + 1 , i t ) E → I Spontaneous (cid:15) i ( s t , e t − , i t + 1) I → R Spontaneous γ i ( s t , e t , i t − Table 1.
Transition rates between Markov chain states
1. The node assignment into one of the age groups (in our model we use 16 age groups, thefrequency of contacts between all groups forms an asymmetric 16x16 matrix A = { a ij } ).2. The distance between contacting nodes. As only relative distances matter in our model, weuse the Euclidean distance between square locations on the population density map. Closernodes have a higher probability to participate in a contact and infection transmission.3. The population density in a particular area (the contact probability for distant contactsdirectly depends on the population density in the area of contact – this is achieved withrejection sampling of remote contacts over the entire simulated region). Model construction
We represent the stochastic
SEIR model as a continuous time 3-dimensional Markov chain X = { ( S ( t ) , E ( t ) , I ( t )) : t ≥ } that tracks the number of susceptible, exposed and infec-tious individuals at any time point. The number of removed individuals can be calculated as R ( t ) = N − S ( t ) − E ( t ) − I ( t ) .The epidemic starts from one or several infectious individuals n ; the population is assumedto consist of fully susceptible individuals; that is, the initial state of X is ( S (0) , E (0) , I (0)) =( N − n , , n ) . In each moment of time the state space of X is described by changing the stateof individuals according to the rules shown in the table 1. All other types of contacts do notchange the state of the system [3]. SIMULATION ALGORITHM
We employ Poisson processes for link activation. When an exposure event occurs at node i ∈ I ,we first randomly select a node j with the Euclidean distance d ( m ij ) less than r , and activate link e ij with the probability depending on the age-contact structure of the population: p ij = a ij /K ,where K is the normalization factor: total contact rate summed for all age groups. If j is in thesusceptible state, the disease is transmitted with the probability p ij and the susceptible nodebecomes infected (exposed).For the remote contact, we select a random node j over the entire simulated geographical re-gion with rejection sampling, so the target node is selected according to the population densitydistribution. This accounts for all non-local contacts (random encounters, travel, etc.), allowingto specify their relative proportion with a single parameter: the frequency of remote contactsrelative to local contacts.n infected node stays in the exposed state for a time interval (exponentially distributed) cal-culated from the mean rate (cid:15) , and after it goes to an infectious state, where it can infect othernodes. Transitions to the recovered state occur according to another Poisson process with theexponentially distributed time intervals calculated from the rate γ . A recovered node neitherinfects nor becomes infected by other nodes.The infection rate of a susceptible node depends on how many infected neighbours it has, but therecovery rate and the time until the transition from the exposed to infectious class is independentof the network configuration and status of neighbours.The mean time to node activation, which enables infection, is given by (cid:104) τ (cid:105) = (cid:82) ∞ τ ψ ( τ ) dτ . Themean time for moving an exposed node to infectious is /(cid:15) and infected node to recover is equalto /γ . A modified Gillespie algorithm
The continuous time evolution of stochastic processes with known transition rates (including epi-demiological applications) can be numerically simulated with Gillespie algorithms family, whichare statistically accurate. For our needs we used a particular variation: the event-modulatedGillespie algorithm [4].In the direct Gillespie algorithm the instantaneous event rates are updated for all processes,following the occurrence of each event, even if the probability density of the inter-event timesfor the process is not perturbed by an event that has occurred elsewhere. This procedure istime-consuming with large N . In the event-modulated Gillespie algorithm, we use a priorityqueue to keep all future events sorted by their occurrence time; after processing the nearestelement in the queue, the future events that cannot modify the global state (e.g. spreading theinfection to people who are already not susceptible) can be safely discarded. Their occurrencetimes are drawn from the corresponding exponential distributions governing the transition rates.Consequently, the event-modulated Poisson process is a mixture of Poisson processes of differentrates.Consider N Poisson processes with the rate of λ i , i ∈ [1 , N ] running in parallel. Denote thedensity of the event rate for the i th process by ρ i ( λ i ) . The renewal process is fully characterisedby the probability density of inter-event times, denote this function as ψ i ( τ ) for i th process.For an event-modulated Poisson process with probability density of the event rate ρ i ( λ i ) , wehave ψ i ( τ ) = (cid:90) ∞ ρ i ( λ i ) e − λ i τ dλ. A Poisson process with rate λ , i.e., ψ i ( τ ) = λ e − λ τ is generated by ρ i ( λ i ) = δ ( λ i − λ ) , where δ is the delta function.As there is a global priority queue already, it makes sense to calculate the contact network ofeach individual in-place, during the processing of each transition event.he algorithm itself is outlined in (Algorithm 1), adapted from István Z. Kiss et al. [5]. Itscore capabilities (infection spreading depending on contact matrices and geographical locationof individuals) are abstracted into the find_contact function (line 17). input : N: array of nodes with 2D coordinates (all in S state); β i , (cid:15) i , γ i : SEIRparameters; t i : regime change dates; n start : initial infected output: W : array of nodes with 2D coordinates and states at time t . Q ← ∅ /* priority queue */ I ← random_sample( N , n start ) for n ← I do e ← Infect(node = n , time = 0) Q .enqueue( e ) end while Q not empty do U ← random(0, 1) e ← Q .dequeue() β t , (cid:15) t , γ t ← regime_params( e .time) if e is Expose then e (cid:48) I ← Infect(node = e .node, time = e .time + − log U(cid:15) t ) Q .enqueue( e (cid:48) I ) e .node.state ← E end if e is Infect then e (cid:48) E ← Expose(node = find_contact( e .node), time = e .time + − log Uβ t ) e (cid:48) R ← Remove(node = e .node, time = e .time + − log Uγ t ) if e (cid:48) E .time < e (cid:48) R .time then Q .enqueue( e (cid:48) E ) end Q .enqueue( e (cid:48) R ) e .node.state ← I end if e is Remove then e .node.state ← R end endAlgorithm 1: Event-modulated Gillespie algorithm for the simulation
SIMULATIONS AND RESULTS
Simulation running
The main simulation code is written in C++ (C++17), using
GDAL library for geospatialtransformations and manipulations, and nanoflann k-d tree implementation for radius queries. igure 1.
Simulation results for Canton of Geneva, Switzerland: A) time +39 days; B) time +69 days; C) time+187 days; D) time + 406 days.
The population density map from Global Human Settlement Layer dataset is loaded as a 2Darray, with areas inside and outside border marked separately. The state of the simulation isexported each day of the simulated time as current SEIR counts, and, more importantly, asPNG file representing the current state of population on the density map (where susceptibleindividuals are represented by grey pixels, exposed — by orange pixels, infected — by red pixels,and removed — by green pixels. With this, localized infection clusters in the simulation areclearly visible, as well as directions of infection spreading.In every moment in time, the simulation software is capable of presenting population densitymaps, showing the exact state of each individual in the simulation. The examples of suchsnapshots are given in the Figure 1. It shows the appearance of new infections in localizedclusters, which are quickly moving to the wide geographical distribution of cases.We run the set of simulations, each simulation starts from the same initial condition, in whicha set of randomly chosen nodes is infected and all the other N − nodes are susceptible. Wemeasure the number of recovered nodes at the end of the simulation normalised by N , called thefinal size, averaged over all simulations. For each simulation run, the current counts of nodesin different states is dumped in the output log at some periodic intervals of simulation time(e.g. every day in the simulation). These states are aggregated and displayed simultaneously asstochastic epidemic curves.The proximity matrix A is constructed using information about population density for the eachsimulated region, obtained from the ESM2015 dataset [6]. The age-dependent contact matrix M is obtained from the work of Kiesha Prem et al [7], which extended the results of the POLYMODproject [8] to 152 countries. pr2020 May Jun Jul Aug Sep Oct Nov Dec Jan2021 Feb Mar020406080100 Figure 2.
Simulation: daily hospitalizations in Geneva canton, by age groups
Surrogate model optimization
To provide the initial epidemic parameters for the full simulation, we use surrogate model opti-mization, fitting the (surrogate) model (a stochastic SEIR model assuming homogeneous contactnetwork) to match the observed data (a FOPH dataset for COVID-19 cases in Switzerland,provided by ETH Zurich [9]).We wrote an adapter function for running the model and exporting the simulation results, andcomparing it with actual observed number of cases and the number of hospitalizations for theselected geographical region. Then we run a nonlinear least squares optimization routine curve_-fit from scikit-optimize
Python package, which works with arbitrary model functions. Then,we chart the best fitted results together with boundary lines for assumed testing rates.After the fitting, the determined parameter values are used as inputs to the simulation of themain model, leaving only local infection distance and near/far rate of contacts as remaining freeparameters.In the first stage of the fitting procedure, we fix the (cid:15) and γ values of the SEIR model (inverseincubation period and inverse infectiousness period, as these are not changing), and assume that β (infectiousness rate) is changing with time due to the different regimes. We do not know whenexactly the first infections started (as the first registered cases appeared while the epidemicswas already in progress for some time), but we have limits on when the first lockdown wasintroduced and when it was lifted. We also assume that there was a regime change sometimeafter the lockdown was lifted (people returning to vacations and/or the end of school holidays). pr2020 May Jun Jul Aug Sep Oct Nov Dec Jan2021 Feb Mar050100150200 h o s p i t a li z a t i o n s / d a y Simulated0-55-1010-1515-2020-2525-3030-3535-4040-4545-5050-5555-6060-6565-7070-7575+
Figure 3.
Daily hospitalizations in Geneva canton, by age groups, compared to the simulation results
This leaves the following parameters to be fitted:1. t k — the regime change dates (actual start of the epidemics, lockdown started, lockdownended, new wave started), with bounds. Some of the regime change points (the date ofintroduction of the new pandemic response policies) are known, and can be therefore fixed,but it still occasionally makes sense to consider them as free parameters: if the optimizerfinds a regime change date completely on its own, from data only, it is a part of verificationthat the model and the optimizer work correctly.2. β k — infection spreading rates for these time intervals (for β it is bounded from initialestimations of R ).3. q k — the assumed percentage(s) of people tested, defined for the each regime interval. It isquite difficult both to estimate how much it has changed through time (e.g. in Switzerland,testing criteria changed nearly every week during the peak of the epidemic), and its presentvalue (there is much uncertainty in e.g. estimating the rate of asymptomatic cases in generalpopulation, the probability of somebody having the symptoms going to do the testing, etc.— all these estimates affect the value of q ). In a sense, q acts as a hyperparameter here —some ranges of its values allows for the more efficient fitting than the others.The fitting routine consists of solving the optimisation problem of finding the minimum of thefunction F ( θ ) = N (cid:88) i =1 ρ ( f i ( θ ) ) , where θ = ( θ , ..., θ r ) is a set of parameters which we want to estimate, N is the number ofavailable data points, ρ is a loss function to reduce the influence of outliers, f i ( θ ) is the i -thcomponent of the vector of residuals.ame Description Typical values β k Infection rates (for different regimes) 0.1–2.5 (cid:15)
Incubation rate 0.15–0.3 γ Removal rate 0.1–0.3 t k Epidemic regimes change times, days from the offi-cial start of the epidemic location dependent
Table 2.
Free parameters to be fitted in the surrogate model
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan2021 Feb Mar0%5%10%15%20%25% % r e c o v e r e d Figure 4.
Simulation: percentage of recovered population in the canton of Geneva
Given a model function m ( t ; θ ) and some data points D = { ( t I , D I | i = 1 , .., N } one normallydefines the vector of residuals as the difference between the model prediction and the data, thatis F i ( θ ) = m ( t i ; θ ) − d i .To compare the simulation outputs against real world observations, we are using the COVID-19 clinical dataset, provided by the Swiss Federal Office of Public Health [9]. The epidemicparameters obtained in the fitting step were submitted to the main simulation code, which ranfor 100 independent iterations. We used the number of hospitalizations as a reference output forcomparison purposes (we consider hospitalizations to be a subclass of R (removed) compartment,as hospitalized patients normally do not infect anybody else). Additionally, we compare thecumulative percentage of recovered patients in the simulation with the numbers obtained fromseroprevalence studies performed in the canton of Geneva [10]. The simulation output as thenumber of the daily hospitalized cases is plotted in the Figure 2. The real hospitalizations datain the canton of Geneva for the same time period is given in the Figure 3, compared to thesimulation values. The simulated percentage of recovered population (for which seroprevalenceis a proxy) is plotted in Figure 4. ESULTS AND DISCUSSION
In this paper we presented a developed framework for stochastic simulation of epidemic processesin populations of any size taking into account population density and age structure in the regionunder consideration.In the most models currently used, the entire population is considered to be homogeneous interms of contact rates (either constant or independent and identically distributed). With thissimplification, everyone has an equal chance of contacting anyone else. The number of infectionsis proportional to the product of the number of infected and the number of susceptible individuals,and depends on the initial rate of transmission of infection (that is, on the properties of the virusitself) and the density of contacts between different groups.In our model, we take into account the contact heterogeneity of the population and differencesin population density in different regions, which significantly changes the prognosis of the devel-opment of the epidemic. We also take into account that at the different stages of the epidemic,the infectiousness rate is changing, due to different quarantine regimes and response policies.To simulate the infection process, we have developed a new variant of the event modulated Gille-spie algorithm. Our implementation supports multiple regime intervals and arbitrary functionsand distributions of epidemic parameters. It can be used to validate the theoretical models andit has the practical application for simulation the epidemic dynamics in larger regions and theentire countries, allowing for different levels of detail.The simulation reproduces faithfully the initial wave of COVID-19 infections which has happenedin March-April 2020. However, even when normalized by the incidence of hospitalizations, theage structure of simulation output does not seem to match the real data observed in the cantonof Geneva; further research is necessary to discriminate between competing hypotheses (e.g.misdetection of cases in the younger population, age-relative differences in travel patterns, caseclusters in retirement homes, etc.)The simulated number of recovered cases can be compared with the result of seroprevalencestudies at a particular point in time. E.g. at May 1st, 2020, the seroprevalence in the cantonwas estimated to be 6.6% (95% confidence interval: 4.3%–9.4%; the simulation gives 2.9% ofrecovered cases at this date. It is lower than the confidence interval estimate, which can beexplained by a number of factors outside the scope of the model (most importantly with thenumber of antibodies diminishing with time, but also geographical clustering, age structureinfluence, etc). However, these results can serve as a limiting point for seroprevalence estimates.Our system allows us not only to track and predict the epidemic progression, but also to simulateand evaluate various measures introduced to contain the spread of infection, such as limitingmobility, partial or complete quarantine, etc. The simulation software also makes it possible tosubsequently model the effectiveness of vaccination and determine the sequence and number ofvaccinated persons required to develop population immunity or to keep the number of infectedat a certain level.
EFERENCES [1] Wouter Van den Broeck et al. “The GLEaMviz computational tool, a publicly availablesoftware to explore realistic epidemic spreading scenarios at the global scale”. In:
BMCinfectious diseases
Reviews ofmodern physics
Applied Mathematics andComputation
265 (2015), pp. 1026–1043.[4] Naoki Masuda and Luis EC Rocha. “A Gillespie algorithm for non-Markovian stochasticprocesses”. In:
SIAM Review
Cham: Springer
598 (2017).[6] Christina Corbane and Filip Sabo.
European Settlement Map from Copernicus Very HighResolution data for reference year 2015 . European Commission, Joint Research Centre(JRC), 2019. doi : .[7] Kiesha Prem, Alex R Cook, and Mark Jit. “Projecting social contact matrices in 152countries using contact surveys and demographic data”. In: PLoS computational biology
PLoS Med
Epidemiological situation in Switzerland and Liechtenstein . . Accessed: 2020-01-18.[10] Silvia Stringhini et al. “Seroprevalence of anti-SARS-CoV-2 IgG antibodies in Geneva,Switzerland (SEROCoV-POP): a population-based study”. In: The Lancet (2020).
PPENDIX A
Age 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75+0-5 1.35 0.56 0.25 0.14 0.21 0.39 0.69 0.73 0.49 0.22 0.22 0.18 0.14 0.13 0.07 0.045-10 0.48 5.67 0.89 0.22 0.15 0.33 0.62 0.80 0.88 0.37 0.21 0.16 0.14 0.12 0.05 0.0510-15 0.18 1.74 9.11 0.81 0.29 0.25 0.43 0.68 1.11 0.64 0.34 0.15 0.09 0.10 0.07 0.0715-20 0.09 0.29 3.11 11.68 1.50 0.77 0.64 0.79 1.09 1.20 0.67 0.24 0.08 0.06 0.03 0.0320-25 0.13 0.17 0.31 2.35 3.95 1.77 1.23 1.15 1.01 1.32 0.98 0.49 0.12 0.06 0.06 0.0625-30 0.35 0.24 0.24 0.89 2.07 3.66 1.91 1.52 1.36 1.17 1.21 0.68 0.21 0.08 0.04 0.0430-35 0.59 0.80 0.64 0.52 1.06 1.82 3.13 1.98 1.58 1.26 0.96 0.67 0.29 0.14 0.06 0.0635-40 0.61 0.95 0.77 0.74 0.77 1.45 1.81 3.14 2.22 1.46 1.06 0.53 0.29 0.21 0.11 0.0540-45 0.35 0.80 1.03 1.20 0.97 1.34 1.73 1.99 3.19 1.89 1.34 0.47 0.22 0.16 0.11 0.0645-50 0.19 0.58 0.79 1.86 1.00 1.12 1.39 1.59 1.79 2.57 1.37 0.60 0.20 0.13 0.11 0.1250-55 0.17 0.64 1.11 1.53 1.15 1.48 1.38 1.38 1.91 2.15 2.37 1.00 0.30 0.16 0.10 0.1255-60 0.30 0.66 0.79 0.94 0.75 1.27 1.35 1.07 1.29 1.13 1.33 1.68 0.52 0.27 0.12 0.1160-65 0.31 0.33 0.26 0.44 0.38 0.61 0.82 0.79 0.63 0.56 0.51 0.71 1.23 0.49 0.26 0.1265-70 0.23 0.35 0.28 0.17 0.27 0.42 0.73 0.68 0.65 0.43 0.44 0.58 0.63 1.29 0.33 0.1770-75 0.10 0.27 0.31 0.31 0.17 0.29 0.33 0.53 0.66 0.52 0.41 0.35 0.67 0.72 1.00 0.3275+ 0.20 0.27 0.40 0.33 0.15 0.18 0.33 0.38 0.48 0.57 0.56 0.35 0.27 0.40 0.33 0.56