Ensemble Forecasting of the Zika Space-TimeSpread with Topological Data Analysis
EEnvironmetrics 00, 1–22
DOI: 10.1002/env.XXXX
Ensemble Forecasting of the Zika Space-TimeSpread with Topological Data Analysis
Marwah Soliman , Vyacheslav Lyubchich and Yulia R. Gel ∗ Summary:
As per the records of the World Health Organization, the first formally reported incidence ofZika virus occurred in Brazil in May 2015. The disease then rapidly spread to other countries in Americas andEast Asia, affecting more than 1,000,000 people. Zika virus is primarily transmitted through bites of infectedmosquitoes of the species Aedes (
Aedes aegypti and
Aedes albopictus ). The abundance of mosquitoes and,as a result, the prevalence of Zika virus infections are common in areas which have high precipitation, hightemperature, and high population density. Nonlinear spatio-temporal dependency of such data and lack ofhistorical public health records make prediction of the virus spread particularly challenging. In this paper weenhance Zika forecasting by introducing the concepts of topological data analysis and, specifically, persistenthomology of atmospheric variables, into the virus spread modeling. The key rationale is that topologicalsummaries allow for capturing higher-order dependencies among atmospheric variables that otherwise mightbe unassessable via conventional spatio-temporal modelling approaches based on geographical proximityassessed via Euclidean distance. We introduce a new concept of cumulative Betti numbers and then integratethe cumulative Betti numbers as topological descriptors into three predictive machine learning models:random forest, generalized boosted regression, and deep neural network. Furthermore, to better quantifyfor various sources of uncertainties, we combine the resulting individual model forecasts into an ensembleof the Zika spread predictions using Bayesian model averaging. The proposed methodology is illustrated inapplication to forecasting of the Zika space-time spread in Brazil in the year 2018.
Keywords:
Zika virus, epidemics, Bayesian model averaging, machine learning, neural network Department of Mathematical Sciences, University of Texas at Dallas, Richardson, USA Chesapeake Biological Laboratory, University of Maryland Center for Environmental Science, Solomons, MD, USA ∗ Correspondence to: Yulia R. Gel , Department of Mathematical Sciences, University of Texas at Dallas, 800 West Campbell Road,Richardson, TX 75080, USA. E-mail: [email protected]
This paper has been submitted for consideration for publication in
Environmetrics a r X i v : . [ q - b i o . P E ] S e p nvironmetrics Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gel
1. INTRODUCTION
The Zika virus is a flavivirus belonging to the
Flaviviridae family which also includes yellowfever, dengue, Japanese encephalitis, and West Nile viruses (Goeijenbier et al. , 2016). Theearliest known occurrence of Zika was identification of the virus in the serum of a rhesusmonkey in 1947 in Uganda (Dick et al. , 1952; Dick, 1952). In 2015 Zika virus was detectedin Brazil, with an estimate of 1.3 million infection cases, and rapidly was transmitted toother countries in North and South Americas, as well as East Asia (Heukelbach et al. , 2016;Malone et al. , 2016). Symptoms of Zika virus infection among humans include headache,fever, reddened eyes, rashes, muscle and joint pain. Furthermore, Zika virus can cause severebirth defects when transmitted from a pregnant woman to her fetus. Given a rapid spread ofthe infection, the World Health Organization declared Zika an epidemic disease from 2015to 2016 (WHO, 2019).Zika is primarily transmitted through the bite of infected
Aedes aegypti and
Aedesalbopictus (CDC, 2018). As a result, the spreading of Zika virus is significantly acceleratedby weather conditions favoring the abundance of the Aedes mosquitoes, for example, intemperate climate zones with high humidity and rainfall (Tjaden et al. , 2013; Rees et al. ,2018; Mu˜noz et al. , 2017). In addition, the virus can be transmitted from mother to herfetus, also via sexual contacts, blood transfusion, and organ transplantation (WHO, 2019).Modeling and forecasting Zika spread continues to attract an ever increasing attention,and there exist three general methodological directions rooted in mathematics, statistics,and machine learning (Ferraris et al. , 2019). Mathematical approaches include vector bornecompartmental and mechanistic transmission models (see, e.g., overviews by Manore andHyman, 2016; Caminade et al. , 2017; Suparit et al. , 2018, and references therein). In turn,statistical methods tend to be largely based on the Box–Jenkins family of models withvarious exogenous predictors, such as atmospheric variables and data from HealthMap digitalsurveillance system, Google trends, and Twitter microblogs (see, e.g., McGough et al. , 2017;2nsemble Forecasting of the Zika Space-Time Spread with Topological Data Analysis
Environmetrics
Teng et al. , 2017; Castro et al. , 2018, and references therein). Finally, machine learningapproaches to modeling spread of Zika virus include random forest (RF), boosted regression(BR), and deep feed-forward neural network (DFFN) models (Soliman et al. , 2019; Seo et al. , 2018). As recently shown by Soliman et al. (2019), the DFFN models tend todeliver more competitive predictive performance than RF and BR. Remarkably, while spatio-temporal epidemiology of (re)emerging infectious diseases has been extensively studied in theenvironmetrics community before (see, for instance, Nobre et al. , 2005; Loh, 2011; Self et al. ,2018, and references therein) and despite the increasing popularity of DL tools (McDermottand Wikle, 2019) in the spatio-temporal environmetrics applications, utility of DL approachesin infectious epidemiology remains largely unexplored.Furthermore, in this paper we bring the tools of topological data analysis (TDA) to spatio-temporal modeling (Diggle et al. , 2005; Torabi, 2013; Nobre et al. , 2005; Jang et al. , 2007;Ugarte et al. , 2009), and prediction of the Zika spread. TDA is an emerging methodologyat the interface of computational topology, statistics, and machine learning that aims toenhance our understanding on the role of the underlying data shape in dynamics of thedata generating process—in our case, the spatio-temporal process of the Zika spread whichexhibits a complex nonstationary dependence structure. Since areas with higher quantitiesof rainfall are principally associated with greater abundance of mosquitoes transmittingZika (Mu˜noz et al. , 2017), our key idea is to employ TDA to explore the structure and toextract information about the shape of the temperature and precipitation data that may beuseful for predicting the spread of Zika infection.While TDA is found to exhibit high utility in many fields, from genetics to finance topower systems (Gidea and Katz, 2018; Saggar et al. , 2018; Li et al. , 2020), application ofTDA within epidemiology (Costa and ˇSkraba, 2014; Lo and Park, 2018) and even moregenerally environmetrics (Islambekov and Gel, 2019; Islambekov et al. , 2019) still remainsvery limited. In particular, TDA and, specifically, persistent homology have been employed3 nvironmetrics
Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gelby Costa and ˇSkraba (2014) for analysis of influenza-like illness during the 2008–2013 fluseasons in Portugal and Italy. Most recently, Lo and Park (2018) show explanatory power ofTDA for analysis of Zika spread. Particularly, Lo and Park (2018) use persistent homologyfactors of the
Aedes aegypti mosquito occurrence locations as regressors within a linear modeland show a high utility of topological features within a spatial cross-validation framework.The findings of Lo and Park (2018) demonstrate that the model with topological featuresas regressors yields higher coefficient of determination ( R ) and lower cross-validation meansquared error than the benchmark linear model without TDA features.Our method advances the approach of Lo and Park (2018) in multiple ways. In contrastto Lo and Park (2018), who do not consider prediction of Zika incidences over time, our goalis to assess utility of TDA in forecasting future spread of Zika, which is the key towards theoutbreak emergency preparedness and response. Since mosquito occurrence locations mayvary over time, any Zika prediction model based on topological features of mosquito locationsalso requires forecasts of the mosquito occurrences. Such forecasts are not readily availableand may be highly non-trivial, especially in areas with heterogeneous landscapes as Brazil.In contrast to Lo and Park (2018), we consider TDA in application to shape analysisof temperature and precipitation rather than mosquito data. Such approach allows us tosystematically incorporate the future weather and climate forecasts from national weatherservices and derived topological features of such forecasts into Zika prediction models.Furthermore, we evaluate predictive utility of TDA based on multiple statistical and machinelearning models, such as random forest, generalized boosted regression, and DFFN. Weintroduce a new topological summary concept cumulative Betti number , which is used asa predictor of future Zika dynamics and found to deliver a more stable performance thanconventional topological characteristics.Finally, to quantify multiple sources of uncertainty, we develop an adaptive ensemble ofZika forecasting models using Bayesian model averaging (BMA). We illustrate our proposed4nsemble Forecasting of the Zika Space-Time Spread with Topological Data Analysis Environmetrics methodology in application to predicting Zika virus spread for all the 26 states of Brazilduring the year 2018.The key contributions of our paper can be summarized as follows: • To the best of our knowledge, this is the first paper introducing topological data featuresas predictors of future space-time spread of infectious diseases. • To increase stability of the topological summaries and associated derived forecasts,especially under the scenarios with limited sample sizes as in the case of manyemerging climate-sensitive infectious diseases, we introduce the notion of cumulativeBetti numbers . The proposed cumulative Betti numbers can be used in many otherapplications, beyond infectious epidemiology, which involve noisy data of moderatesample sizes, e.g., household travel networks and microgrids. • We validate utility of our proposed predictive approach across various nonparametricmachine learning models, including deep neural networks, which allows for moresystematic and objective assessment of predictive gains (if any) delivered by theproposed topological descriptors.The remainder of the paper is organized in four major sections: data description,methodology for epidemiological forecasting and validation metrics, results, and discussion.In Section 2, we provide information on the collected Zika rate, population density, andatmospheric data. We introduce our topological data analysis in Section 3.1 and statisticaland machine learning forecasting methodology in Section 3.2. Section 3.3 lists the validationmetrics, Section 4 is devoted to validation of the proposed modeling approaches to predictionof Zika rate in Brazil. Finally, the paper is concluded with a discussion in Section 5.5 nvironmetrics
Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gel
2. DATA DESCRIPTION
Brazil’s Ministry of Health publishes cumulative Zika rate on a weekly basis. However, aswith many other vector-borne diseases, the official surveillance records for zika are oftenincomplete and noisy. For instance, for the year of 2018 the percentage of missing weeksis 28.8%, or 15 weeks out of 52 weeks. In turn, 24 weeks out of 52 weeks are missing inthe year of 2017, which is 46.15%. Hence, we analyze monthly records of Zika virus rate per100,000 individuals, constructed from the cumulative weekly Zika rates published by Brazil’sMinistry of Health (MHB, 2018), in each of the 26 states of Brazil and each month during2017 and 2018.Considering the strong association between Zika virus transmission and local environmentalconditions (e.g., see Tjaden et al. , 2013; Mu˜noz et al. , 2017; Rees et al. , 2018), we alsocollected data for precipitation and air temperature. The weather station data are accessedfor all states for years 2017 and 2018 through World-Weather-Online (2018). The forecastedprecipitation data for 2018 are obtained from INMET (2019). To ensure consistency intemporal resolutions of zika and atmospheric data, all daily atmospheric variables areaggregated to a monthly scale.
3. METHODOLOGY FOR EPIDEMIOLOGICAL FORECASTING3.1. Topological data analysis
Topological data analysis (TDA) is a rapidly emerging methodology at the interface ofcomputational topology, statistics, and machine learning which allows for a systematicmulti-lens assessment of the underlying topology and geometry of the data generatingprocess (Zomorodian and Carlsson, 2005; Carlsson, 2009; Chazal and Michel, 2017). In thispaper, we primarily employ tools of persistent homology (PH) within the TDA framework.The ultimate idea of PH is to quantify dynamics of topological properties exhibited by6nsemble Forecasting of the Zika Space-Time Spread with Topological Data Analysis
Environmetrics the data that live in Euclidean or abstract metric space, at various resolution scales. ThePH approach is implemented in the two key steps: first, the underlying hidden topology ofthe observed data set is approximated using certain combinatorial objects, e.g., simplicialcomplexes, and then the evolution of these combinatorial objects is studied as the resolutionscale varies.We start with providing a brief overview of the main relevant technical concepts.
Definition 1 (Abstract simplicial complex)
Let P be a discrete set. Then, an abstractsimplicial complex is a collection K of finite nonempty subsets of P such that if σ ∈ K and τ ⊂ σ , then τ ∈ K . If | σ | = k + 1 , then σ is called a k -simplex. In case of an Euclidean space, k -simplex corresponds to a convex hull of k + 1 vertices.Hence, a 0-simplex is a vertex, a 1-simplex is an edge, a 2-simplex is a triangle, and a3-simplex is a tetrahedron.One of the most widely used choices for a simplicial complex within the TDA framework isa Vietoris–Rips complex which has gained its popularity due to its computational propertiesand tractability. Definition 2 (Vietoris–Rips complex)
Let ( X, d ) be a metric space, e.g., R m and P ⊆ X be a set of distinct points of X . Let (cid:15) > be a scale. Then the Vietoris–Rips complex V (cid:15) ( P ) is an abstract simplicial complex whose finite simplices σ in P have a diameter at most (cid:15) ,i.e., V (cid:15) ( P ) = { σ ⊆ P | d ( u, v ) (cid:54) (cid:15), ∀ u (cid:54) = v ∈ σ } . That is, we form the proximity graph of P by joining two points in P whenever their pairwisedistance is less than (cid:15) . We now consider a sequence of scales (cid:15) < (cid:15) < . . . < (cid:15) n and associatednested sequence of VR complexes called a Vietoris–Rips filtration V (cid:15) ( P ) ⊆ V (cid:15) ( P ) ⊆ . . . ⊆ V (cid:15) n ( P ). As a result, we can study evolution of topological summaries, such as number ofconnected components, loops, etc., that appear and disappear with an increase of scale7 nvironmetrics Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gel (cid:15) . Persistent topological features, i.e., those with a longer lifespan over varying resolution (cid:15) < (cid:15) < . . . < (cid:15) n tend to be associated with the underling structural organization of thedata generating process, while features with a shorter lifespan are likely to be a topologicalnoise. [Figure 1 about here.]Figure 1 is an illustration of the Vietoris–Rips filtration process based on a toy exampleof six points. The filtration starts with a ball of radius 0 ( (cid:15) = 0) around each point (seeFigure 1a). As the scale (cid:15) increases to 0.2, 0.5, and 1 (see Figures 1b, 1c, and 1d, respectively),the number of connected components decreases and new topological features, such as theloop, appear.There exist multiple topological summaries to quantify evolution of topological featuresover the increasing scale (cid:15) , e.g., barcode, persistent diagrams, persistent landscapes, andBetti numbers (see the discussion in Chazal and Michel, 2017, and references therein). Inthis project we focus on utility analysis of Betti numbers. Definition 3 (Betti number)
The Betti- k number β k is the rank of the k -th homologygroup. That is, β k ( (cid:15) ) is the number of k -dimensional simplicial complex features for a givenscale (cid:15) . For a given scale ε and a given abstract simplicial complex, e.g., the Vietoris–Rips complex,constructed from the observed point cloud under scale ε , Betti numbers are simply countsof particular topological features in this abstract simplicial complex. For instance, β isthe count of connected components in the Vietoris–Rips complex; while β and β are thenumbers of holes and voids in the Vietoris–Rips complex, respectively.Furthermore, we introduce the concept of cumulative Betti numbers which, as we find, tendto deliver more stable predictive performance for forecasting the Zika spread.8nsemble Forecasting of the Zika Space-Time Spread with Topological Data Analysis Environmetrics
Definition 4 (Cumulative Betti numbers)
Over a sequence of scales (cid:15) < . . . < (cid:15) n ,cumulative Betti number ˜ β k ( (cid:15) m ) , m (cid:54) n , is defined as the sum of Betti numbers β k ( (cid:15) i ) , i = 1 , . . . , m . That is, ˜ β k ( (cid:15) m ) = m (cid:88) i =1 β k ( (cid:15) i ) , m (cid:54) n. (1)In this paper, due to the limited data records, we use only the ˜ β numbers as topologicaldescriptors, that is, the cumulative number of connected components. We hypothesize thathigher predictive performance delivered by cumulative Betti numbers may be explained byhigher robustness of ˜ β to scale sequence selection under uncertainty due to low sample sizes.We used the function ripsDiag in R package TDA (Fasy et al. , 2018) to find the cumulativeBetti 0 numbers.Figure 2 depicts cumulative Betti-0, ˜ β , numbers based on precipitation amounts andtemperature in the states of Acre and Alagoas, Brazil.[Figure 2 about here.]To extract the topological characteristics for precipitation ( X i,j ) and temperature ( X i,j ),we use the following algorithm.Let M j be the number of weather stations in each state j and X i,j,k be thecorresponding weather observations in month i ( i = 1 , . . . , j = 1 , . . . ,
26, and k =1 , . . . , M j ). (Here we suppress the superscripts for the sake of notations.) Consider a pointcloud { X i,j, , X i,j, , . . . , X i,j,M j } for each month i and state j . We now measure (dis)similarity(in terms of the Euclidean distance) among recorded atmospheric variables in the i -thmonth across all weather stations k = 1 , . . . , M j in the j -th state. For instance, in a case oftemperature ( X i,j ), we set up (cid:15) (in degrees Celsius) and obtain a distance graph by connectingtwo weather stations l and m only if their recorded temperature observation differ at most (cid:15) indegrees Celsius, i.e., | X i,j,l − X i,j,m | ≤ (cid:15) . We now count a number of topological features (e.g.,9 nvironmetrics Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gela number of connected components, or Betti 0 β ( (cid:15) )) for a given scale (cid:15) . Then we increase scale (cid:15) and repeat the procedure. Summing over consider scales (see (1)) yields a cumulative Bettinumber for the i -th month in the j -th state ( i = 1 , . . . , j = 1 , . . . , We employ three statistical and machine learning models to predict the Zika activity at thestate level, namely, random forest (RF), boosted regression (BR), and deep feed-forwardneural networks (DFFN). Using Bayesian model averaging (BMA), we then develop a multi-model ensemble of the Zika forecasts. We train all the models on the year 2017 data (trainingset), then evaluate their out-of-sample forecasts using the 2018 data (testing set). Belowwe provide an overview of the considered modeling approaches and the forecast validationmetrics.Let Y i,j be the Zika rate in month i in state j ( i = 1 , . . . ,
12 and j = 1 , . . . ,
26) and X = { X i,j , . . . , X i,j } be the corresponding regressors. In each of the models, X i,j is theprecipitation, X i,j is the temperature, X i,j is the cumulative Betti-0 for the precipitation, X i,j is the cumulative Betti-0 for the temperature, X i,j is the categorical variable representingmonth (January, . . . , December), X i,j = X i − ,j is the precipitation lagged by one month, and X i,j = X i − ,j is the temperature lagged by one month. We added the lagged variables ( X i,j and X i,j ) to account for delay effects. Random forest (RF)
The random forest is one of the most popular models in machinelearning and statistics, with an idea to combine several individual decision tree models into10nsemble Forecasting of the Zika Space-Time Spread with Topological Data Analysis
Environmetrics an additive multi-model ensemble (Breiman, 2001): g ( x ) = (cid:88) k ∈ Z + f k ( x ) , where f k ( x ) is an individual decision or regression tree. The two fundamental ideas behindthe RF approach are i) for constructing individual trees f k , observations from the trainingset are randomly sampled with replacement (i.e., bootstrapped), and ii) each individual splitin a tree is based on a random subset of variables that are used in the model. These conceptsof RF allow to reduce the effect of overfitting and to optimize a bias-variance trade-off. Weused the function randomForest in R package randomForest (Breiman et al. , 2018) to trainthe random forest model. Generalized boosted regression model
The model is built on two techniques: decisiontree algorithms and boosting methods (Breiman, 1997; Friedman, 2001, 2002). Generalizedboosted models iteratively fit many decision trees to improve the accuracy of the model.Let J be the number of terminal nodes in a regression tree. The tree partitions the inputspace X into J disjoint regions R , . . . , R J . Each regression tree model itself takes an additiveform h ( x ; { b j , R j } J ) = J (cid:88) j =1 b j I ( x ∈ R j ) , where b j is the value predicted in the region R j , x is the input datum, and I ( · ) is the indicatorfunction. Then, the update at the m -th iteration, m ∈ Z + , is F m ( x ) = F m − ( x ) + ρ m J (cid:88) j =1 b jm I ( x ∈ R jm ) , where ρ m is a scaling factor and the solution to the “line search”; { b jm } Jj =1 are thecorresponding least-squares coefficients, and { R jm } Jj =1 are the regions defined by the terminalnodes of the tree at the m -th iteration. 11 nvironmetrics Marwah Soliman, Vyacheslav Lyubchich and Yulia R. GelLet γ jm = b jm ρ m . Note that a separate choice for optimal value γ jm is proposed by Friedman(2001) in each tree region, opposed to a single γ m in an entire tree. Update rule for modelnow becomes: F m ( x ) = F m − ( x ) + J (cid:88) j =1 γ jm I ( x ∈ R jm ) , (2) γ jm = arg min γ (cid:88) x i ∈ R jm L ( y i , F m − ( x i ) + γ ) , (3)where (3) is just the optimal constant update in each terminal node region, based on theloss function L , given the current approximation F m − ( x i ).Additional boosting algorithm details are available in Ridgeway (2019) and Friedman(2001). For the boosted regression, we used the function gbm in R package gbm (Greenwell et al. , 2019) with the number of trees equal to 100 and the interaction depth of 3, chosenthrough a cross-validation. Deep feed-forward neural network (DFFN)
The DFFN algorithm (Hagan andMenhaj, 1994; Riedmiller and Braun, 1993; Zhang et al. , 2007) is performed as follows:input data move forward from input to hidden layers to output, and at each of those movesthe data are non-linearly transformed (using so-called activation function ) and re-weighted.When reaching the output layer, the error is calculated, based on a selected cost function,which reflects how far the DFFN output is with respect to the actual data (i.e., Zika rate)in the training set.To illustrate the DFFN algorithm, let w ljk be the weight connecting the k -th neuron in the( l − j -th neuron in the l -th layer, and let σ ( · ) be the activation function (weuse sigmoid function), where j, k, l ∈ Z + . The activation values at each layer are calculated12nsemble Forecasting of the Zika Space-Time Spread with Topological Data Analysis Environmetrics in the forward procedure as a lj = σ (cid:32)(cid:88) k w ljk a l − k + b lj (cid:33) , where b lj is the bias coefficient, analogous to the intercept term in regression models. Aftercalculating the total error at the output layer, a partial derivative of the cost function withrespect to each weight and bias term is obtained. We train DFFN using h2o.deeplearning function in R package h2o (LeDell et al. , 2019). The optimal structure for DFFN is selectedthrough a cross-validation: with 2 hidden layers, 12 nodes in each layer, and learning rate 0.01. Bayesian model averaging (BMA)
We evaluate uncertainty of epidemiological forecastsby constructing a weighted multi-model ensemble. This application of BMA enables acombination of multiple models, with their respective weights assigned according to themodels’ most recent predictive performance (Hoeting et al. , 1999; Raftery et al. , 2005;Fragoso et al. , 2018).In this project, we define model weights within a BMA using root mean square error(RMSE) calculated for the training set of data:
RM SE ( s ) = (cid:115) (cid:80) i =1 (cid:80) j =1 ( y ij − ˜ y ij ( s )) n , where y ij is the observed data point (in our case, the observed Zika rate in i -th month of2017 at the j -th state), ˜ y ij ( s ) is the corresponding estimate delivered by the s -th model( s = 1 , . . . , S ), n is the total size of the training data set, and S is the number of models.Then, the resulting weight ω s for the s -th model in the BMA ensemble of forecasts iscomputed as follows: ω s = 1 /RM SE ( s ) (cid:80) i =1 /RM SE ( i ) . Now, let
P red ( s ) ( s = 1 , . . . , S ) denote an out-of-sample forecast produced by the s -th13 nvironmetrics Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gelmodel (in our case, the Zika rate forecast of the s -th model for the j -th state and i -th monthin 2018). Then the multi-model BMA forecast for that state is defined as P red
RMSE = S (cid:88) s =1 ω s P red ( s ) . We set s = 1 to represent the random forest model, s = 2 to represent generalized boostedregression, and s = 3 to correspond to the deep feed-forward neural network. We employ two standard statistical measures of accuracy, that is, root mean square error(RMSE) and mean absolute error (MAE) (Bluman, 2009; Peck et al. , 2015; Kim and Ahn,2019). Let y i,j and ˆ y i,j be the observed and predicted Zika rates for the i -th month in j -thstate in the testing data set. Then, RM SE = (cid:115) (cid:80) i =1 (cid:80) j =1 (ˆ y i,j − y i,j ) N ,M AE = 1 N (cid:88) i =1 26 (cid:88) j =1 | ˆ y i,j − y i,j | , where N is the size of the testing data set. In addition, correlation between observed andforecasted values is measured through the Pearson correlation coefficient r ( y, ˆ y ) = (cid:80) i =1 (cid:80) j =1 (ˆ y i,j − ¯ˆ y )( y i,j − ¯ y ) (cid:113)(cid:80) i =1 (cid:80) j =1 (ˆ y i,j − ¯ˆ y ) (cid:113)(cid:80) i =1 (cid:80) j =1 ( y i,j − ¯ y ) , where ¯ y denotes average Zika rate across all states of Brazil in year 2018, and ¯ˆ y is the averageof out-of-sample predictions across all the states for the corresponding period.Lower RMSE, MAE, and higher Pearson correlation coefficient imply better forecastingperformance. 14nsemble Forecasting of the Zika Space-Time Spread with Topological Data Analysis Environmetrics
4. RESULTS
We train our models based on the data observed in the year of 2017 and use the 2018 datafor verification. In particular, let Y i,j be a vector of Zika rates in the i -th month and the j -thstate of year 2017, and let X be a matrix of regressors (i.e., design matrix) in the year of 2017.We now compare the performance of the models with persistent features (i.e., with the fullset of inputs X i,j , . . . , X i,j ) against the models without persistent features (i.e., includingonly X i,j , X i,j , and X i,j as the inputs). As Table 1 demonstrates, models with persistentfeatures tend to perform better than models without persistent features.To address whether predictive power of models with topological features is significantlydifferent from models without topological features, we perform Welch’s t -test (Welch, 1947;Starnes et al. , 2010) on the absolute and squared errors delivered by models with andwithout topological predictors, under the null hypothesis that topological predictors yieldno predictive gain. In addition, we assess whether the deep learning model (i.e., a deep feed-forward network) yields a significantly different predictive gain comparing to random forestand boosted regression. [Table 1 about here.][Table 2 about here.]Table 1 indicates that topological features lead to a highly statistically significant predictivegain, while incorporated into the DL DFFN model. However, while topological features resultin lower predictive RMSEs for random forest and boosted regression, the predictive gainsevaluated using Welch’s t -test, appear to be not significant. In turn, Table 2 suggests thatDFFN with persistent features delivers the most competitive predictive performance amongthe three individual models, with Welch’s t -test p -values of < . nvironmetrics Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gelto deliver highly statistically significant forecasting utility within the BMA framework (seeTable 1).For example, the incorporation of the persistent features lowers the RMSE of RF by 6.5%,BR by 12.0%, and DFFN by 18.4%, while the RMSE of the BMA decreases by 19.8%. Similarimprovements are observed in terms of MAE, with the MAE of BMA decreasing by 45.7%.[Figure 3 about here.]Among the three individual models considered in this study, the tree-based models doesnot deliver the best forecasting performance. Figure 3a shows the RF predictions are groupedabove the 45-degree line, denoting over-prediction of the Zika rates; bias of RF is a well knownissue (see technical details, e.g., in Chapter 15 by Hastie et al. , 2009). The predictions fromother models are less biased overall. Particularly, DFFN (Figure 3c) and BMA (Figure 3d)show a better balance of over-prediction and under-prediction. However, the individual errorsare larger (points in the graphs for BR are farther from the 45-degree line) than for RF, seeFigure 3b. [Figure 4 about here.]In addition, we examine the geographical distribution of prediction errors of each model.The map in Figure 4 suggests that the BMA-RMSE approach delivers the highest predictiveperformance for all considered states. Remarkably, among the best predicted states using theBMA-RMSE approach are the states with the highest infection rates, that is, Mato Grosso,Goias, and Mato Grosso do Sul (see also Figure 5 showing the average Zika rates, and Ragas,2018). [Figure 5 about here.]
Remark
Finally, while Pearson correlations reported in Table 1 are generally low, theresulting correlations appear to be overall on par with the previous studies on zika in other16nsemble Forecasting of the Zika Space-Time Spread with Topological Data Analysis
Environmetrics
Latin and Southern American countries (McGough et al. , 2017). In particular, McGough et al. (2017) report Pearson correlations for 3-week ahead forecast of zika in Honduras notexceeding 0.35 and as low as 0.08. Similar magnitude of Pearson correlations of about 0.30are found for autoregressive models for 3-week ahead forecast in Colombia, while correlationsin El Salvador, Martinique, and Venezuela tend to be somewhat higher. Overall, correlationsreported by McGough et al. (2017) tend to vary substantially among countries and amongpredictive models and to decrease noticeably with the increase of the forecasting horizon.
5. DISCUSSION
Zika continues to be one of the primary healthcare concerns in the Americas, Oceania, Africaand many other parts of the world – making the virus a global challenge for healthcareprofessional worldwide and demanding the world’s collective preparedness for this climate-sensitive (re)emerging disease.In this paper we have brought the concepts of topological data analysis of atmosphericvariables to enhance prediction of Zika virus. In particular, we have integrated the cumulativeBetti numbers as topological descriptors of precipitation and temperature dynamics—one ofthe key environmental factors in Zika spread—into three predictive machine learning modelsfor Zika: random forest, boosted regression, and deep feed-forward neural network. To betteraccount for various sources of uncertainties and harness the power of individual predictivemodels, we have combined the resulting individual forecasts into an ensemble of the Zikaspread predictions using Bayesian model averaging. Our findings based on the analysis ofZika space-time spread in Brazil have indicated that topological summaries of precipitationdynamics and air temperature contain important predictive information for the future Zikaspread.In the future we plan to advance the proposed topological approach to analysis of other17 nvironmetrics
Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gelrelated climate-sensitive diseases such as chikungunya and dengue, not only in Brazil, but alsoin other countries. Furthermore, epidemiological forecasting can benefit from using the toolsof topological data analysis for understanding the spread of diseases through examination ofthe joint dynamics of topological summaries of the disease rates and associated environmentaland socio-economic factors.Finally, there also exist multiple directions how to better account for various types ofuncertainties associated with biosurveillance of emerging climate-sensitive infectious diseases.To address the issue of limited data records and noisy epidemiological information, we planto integrate various nontraditional data sources, such as webqueries, into surveillance andforecasting of the emerging infectious diseases. Understanding topological summaries of suchnontraditional data sources and matching topological patterns of the conventional datafrom public health units and webqueries can shed a light on future spatial dynamics ofthe emerging infectious disease. We then plan to employ the semi-parametric bootstrap todevelop probabilistic forecasts associated with each model and multiple data sources, whichin return can be combined into a joint biosurveillance ensemble.
REFERENCES
Bluman AG, 2009.
Elementary statistics: A step by step approach . McGraw-Hill Higher Education New York.Breiman L, 1997. Arcing the edge.
Technical Report 486. Statistics Department, University of California,Berkeley .Breiman L, 2001. Random forests.
Machine learning (1): 5–32.Breiman L, Cutler A, Liaw A, Wiener M, 2018. randomForest: Breiman and Cutler’s Random Forests forClassification and Regression . R package version 4.6-14.Caminade C, Turner J, Metelmann S, Hesson JC, Blagrove MSC, Solomon T, Morse AP, Baylis M, 2017.Global risk model for vector-borne transmission of Zika virus reveals the role of El Ni˜no 2015. Proceedingsof the National Academy of Sciences (1): 119–124.Carlsson G, 2009. Topology and data.
Bulletin of the American Mathematical Society (2): 255–308. Environmetrics
Castro MC, Han QC, Carvalho LR, Victora CG, Fran¸ca GV, 2018. Implications of Zika virus and congenitalZika syndrome for the number of live births in Brazil.
Proceedings of the National Academy of Sciences (24): 6177–6182.CDC, 2018. Centers for disease control and prevention (About Zika).Chazal F, Michel B, 2017. An introduction to topological data analysis: Fundamental and practical aspectsfor data scientists. arXiv Preprint arXiv:1710.04019 .Costa JP, ˇSkraba P, 2014. A topological data analysis approach to epidemiology. In
European Conference ofComplexity Science .Dick G, 1952. Zika virus (II). Pathogenicity and physical properties.
Transactions of the Royal Society ofTropical Medicine and Hygiene (5): 521–534.Dick G, Kitchen S, Haddow A, 1952. Zika virus (I). Isolations and serological specificity. Transactions of theRoyal Society of Tropical Medicine and Hygiene (5): 509–520.Diggle P, Rowlingson B, Su Tl, 2005. Point process methodology for on-line spatio-temporal diseasesurveillance. Environmetrics: The official journal of the International Environmetrics Society (5): 423–434.Fasy BT, Kim J, Lecci F, Maria C, Millman DL, Rouvreau V, Morozov D, Bauer U, Kerber M, ReininghausJ, 2018. TDA: Statistical Tools for Topological Data Analysis . R package version 1.6.4.Ferraris P, Yssel H, Miss´e D, 2019. Zika virus infection: An update.
Microbes and infection .Fragoso TM, Bertoli W, Louzada F, 2018. Bayesian model averaging: A systematic review and conceptualclassification.
International Statistical Review (1): 1–28.Friedman JH, 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics (5):1189–1232.Friedman JH, 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis (4): 367–378.Gidea M, Katz Y, 2018. Topological data analysis of financial time series: Landscapes of crashes. Physica A:Statistical Mechanics and Its Applications : 820–834.Goeijenbier M, Slobbe L, Van der Eijk A, de Mendon¸ca Melo M, Koopmans M, Reusken C, 2016. Zika virusand the current outbreak: an overview.
Neth J Med (3): 104–9.Greenwell B, Boehmke B, Cunningham J, Developers G, 2019. gbm: Generalized Boosted Regression Models .R package version 2.1.5. nvironmetrics Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gel
Hagan MT, Menhaj MB, 1994. Training feedforward networks with the Marquardt algorithm.
IEEETransactions on Neural Networks (6): 989–993.Hastie TJ, Tibshirani RJ, Friedman JH, 2009. The Elements of Statistical Learning: Data Mining, Inference,and Prediction . Springer, New York, 2 edition.Heukelbach J, Alencar CH, Kelvin AA, de Oliveira WK, de G´oes Cavalcanti LP, 2016. Zika virus outbreakin Brazil.
The Journal of Infection in Developing Countries (02): 116–120.Hoeting JA, Madigan D, Raftery AE, Volinsky CT, 1999. Bayesian model averaging: a tutorial. StatisticalScience (4): 382–401.INMET, 2019. Instituto nacional de meteorologia. .Islambekov U, Gel YR, 2019. Unsupervised space–time clustering using persistent homology. Environmetrics (4): e2539.Islambekov U, Yuvaraj M, Gel YR, 2019. Harnessing the power of topological data analysis to detect changepoints. Environmetrics : e2612.Jang MJ, Lee Y, Lawson AB, Browne WJ, 2007. A comparison of the hierarchical likelihood and bayesianapproaches to spatial epidemiological modelling.
Environmetrics: The official journal of the InternationalEnvironmetrics Society (7): 809–821.Kim J, Ahn I, 2019. Weekly ILI patient ratio change prediction using news articles with support vectormachine. BMC Bioinformatics (1): 259.LeDell E, Gill N, Aiello S, Fu A, Candel A, Click C, Kraljevic T, Nykodym T, Aboyoun P, Kurka M,Malohlava M, 2019. h2o: R Interface for ‘H2O’ . R package version 3.26.0.2.Li B, Ofori-Boateng D, Gel YR, Zhang J, 2020. A hybrid approach for transmission grid resilience assessmentusing reliability metrics and power system local network topology. Sustainable and Resilient Infrastructure : 1–16.Lo D, Park B, 2018. Modeling the spread of the Zika virus using topological data analysis.
PloS One (2):e0192120.Loh JM, 2011. K-scan for anomaly detection in disease surveillance. Environmetrics (2): 179–191.Malone RW, Homan J, Callahan MV, Glasspool-Malone J, Damodaran L, Schneider ADB, Zimler R, TaltonJ, Cobb RR, Ruzic I, et al. , 2016. Zika virus: medical countermeasure development challenges. PLoSNeglected Tropical Diseases (3): e0004530. Environmetrics
Manore C, Hyman M, 2016. Mathematical models for fighting Zika virus.
Siam News .McDermott PL, Wikle CK, 2019. Deep echo state networks with uncertainty quantification for spatio-temporal forecasting.
Environmetrics (3): e2553.McGough SF, Brownstein JS, Hawkins JB, Santillana M, 2017. Forecasting Zika incidence in the 2016 LatinAmerica outbreak combining traditional disease surveillance with search, social media, and news reportdata. PLoS Neglected Tropical Diseases (1): e0005295.MHB, 2018. Ministry of Health of Brazil. http://portalms.saude.gov.br/boletins-epidemiologicos .Mu˜noz ´AG, Thomson MC, Stewart-Ibarra AM, Vecchi GA, Chourio X, N´ajera P, Moran Z, Yang X, 2017.Could the recent Zika epidemic have been predicted? Frontiers in Microbiology : 1291.Nobre AA, Schmidt AM, Lopes HF, 2005. Spatio-temporal models for mapping the incidence of malaria inpar´a. Environmetrics (3): 291–304.Peck R, Olsen C, Devore JL, 2015. Introduction to statistics and data analysis . Cengage Learning.Raftery AE, Gneiting T, Balabdaoui F, Polakowski M, 2005. Using Bayesian model averaging to calibrateforecast ensembles.
Monthly Weather Review (5): 1155–1174.Ragas J, 2018. Revisiting global health from the periphery: the Zika virus.
Hist´oria, Ciˆencias, Sa´ude-Manguinhos (4): 1185–1187.Rees EE, Petukhova T, Mascarenhas M, Pelcat Y, Ogden NH, 2018. Environmental and social determinantsof population vulnerability to Zika virus emergence at the local scale. Parasites & Vectors (1): 290.Ridgeway G, 2019. Generalized boosted models: A guide to the gbm package. Update (1): 2019.Riedmiller M, Braun H, 1993. A direct adaptive method for faster backpropagation learning: The RPROPalgorithm. In Proceedings of the IEEE International Conference on Neural Networks , volume 1993, SanFrancisco, CA, 586–591.Saggar M, Sporns O, Gonzalez-Castillo J, Bandettini PA, Carlsson G, Glover G, Reiss AL, 2018. Towardsa new approach to reveal dynamical organization of the brain using topological data analysis.
NatureCommunications (1): 1399.Self SCW, McMahan CS, Brown DA, Lund RB, Gettings JR, Yabsley MJ, 2018. A large-scale spatio-temporalbinomial regression model for estimating seroprevalence trends. Environmetrics (8): e2538.Seo J, Lee J, Kim K, 2018. Decoding of polar code by using deep feed-forward neural networks. In , IEEE, 238–242. nvironmetrics Marwah Soliman, Vyacheslav Lyubchich and Yulia R. Gel
Soliman M, Lyubchich V, Gel YR, 2019. Complementing the power of deep learning with statistical modelfusion: Probabilistic forecasting of influenza in Dallas County, Texas, USA.
Epidemics : 100345.Starnes DS, Yates D, Moore DS, 2010. The practice of statistics . Macmillan.Suparit P, Wiratsudakul A, Modchang C, 2018. A mathematical model for Zika virus transmission dynamicswith a time-dependent mosquito biting rate.
Theoretical Biology and Medical Modelling (1): 11.Teng Y, Bi D, Xie G, Jin Y, Huang Y, Lin B, An X, Feng D, Tong Y, 2017. Dynamic forecasting of Zikaepidemics using Google Trends. PloS One (1): e0165085.Tjaden NB, Thomas SM, Fischer D, Beierkuhnlein C, 2013. Extrinsic incubation period of dengue:Knowledge, backlog, and applications of temperature dependence. PLoS Neglected Tropical Diseases (6):e2207.Torabi M, 2013. Spatio–temporal modeling for disease mapping using car and b-spline smoothing. Environmetrics (3): 180–188.Ugarte M, Goicoa T, Ibanez B, Militino A, 2009. Evaluating the performance of spatio-temporal bayesianmodels in disease mapping. Environmetrics: The official journal of the International EnvironmetricsSociety (6): 647–665.Welch BL, 1947. The generalization of student’s’ problem when several different population variances areinvolved. Biometrika (1/2): 28–35.WHO, 2019. World Health Organization. .World-Weather-Online, 2018. .Zhang JR, Zhang J, Lok TM, Lyu MR, 2007. A hybrid particle swarm optimization–back-propagationalgorithm for feedforward neural network training. Applied Mathematics and Computation (2): 1026–1037.Zomorodian A, Carlsson G, 2005. Computing persistent homology.
Discrete & Computational Geometry (2): 249–274. Environmetrics
FIGURES (a) (cid:15) = 0 (b) (cid:15) = 0 . (c) (cid:15) = 0 . (d) (cid:15) = 1 Figure 1.
An illustration of Vietoris–Rips filtration with varying scales (cid:15) . nvironmetrics FIGURES
AcreAmapa C u m u l a t i v e B e tt i (a) Precipitation, mm
AcreAmapa C u m u l a t i v e B e tt i (b) Temperature, ◦ C Figure 2.
Cumulative Betti-0, ˜ β , numbers for precipitation and temperature inthe states Acre and Alagoas of Brazil in 2017. Environmetrics P r ed i c t ed (a) Random forest P r ed i c t ed (b) Boosted regression P r ed i c t ed (c) DFFN P r ed i c t ed (d) BMA-RMSE
Figure 3.
Predicted values from each of the four models vs. the actual Zika rate in2018. The black 45-degree lines represent the case of ideal prediction accuracy. nvironmetrics FIGURES
RMSE (a)
Random forest (b)
Boosted regression (c)
DFFN (d)
BMA-RMSE
Figure 4.
Map of predictive root mean square error (RMSE) in Brazil deliveredby each of the four models for Zika rate in 2018.
Environmetrics
Zika rate per 100,000 (a)
Observed (b)
Random forest (c)
Boosted regression (d)
DFFN (e)
BMA-RMSE
Figure 5.
Map of the (a) observed average Zika rate in 2018 and (b–e) averagepredicted Zika rates in 2018 by each of the four models. nvironmetrics TABLES
TABLESTable 1.
Performance summary for out-of-sample forecasts in terms of MSE, MAE andWelch’s t -test p -values constructed for MSEs and MAEs of the models with and withouttopological features With persistent features Without persistent features MSE MAEMethod RMSE MAE r ( y, ˆ y ) RMSE MAE r ( y, ˆ y ) p -value p -valueRandom forest 11.351 8.262 0.391 12.134 8.667 0.307 0.348 0.396Boosted regression 13.238 9.154 0.363 15.044 10.091 0.286 0.093 0.262DFFN 8.934 6.361 0.425 10.953 8.084 0.347 0.001 0.007BMA-RMSE 6.721 4.265 0.419 8.380 7.854 0.397 < . < . Environmetrics
Table 2.
Welch’s t -test p -values for comparison of predictive gains delivered by the deepfeed-forward network (DFFN) vs. random forest and boosted regressionModel With persistent features Without persistent featuresRandom forest 0.002 0.277Boosted regression < ..