A robust and non-parametric model for prediction of dengue incidence
AA robust and non-parametric model forprediction of dengue incidence
Atlanta Chakraborty and Vijay Chandru Department of Mathematics, Indian Institute of Science,[email protected] Center For BioSystems Science And Engineering, IndianInstitute of Science, [email protected]
Abstract
Disease surveillance is essential not only for the prior detection ofoutbreaks but also for monitoring trends of the disease in the long run.In this paper, we aim to build a tactical model for the surveillance ofdengue, in particular. Most existing models for dengue prediction ex-ploit its known relationships between climate and sociodemographic fac-tors with the incidence counts, however they are not flexible enough tocapture the steep and sudden rise and fall of the incidence counts. Thishas been the motivation for the methodology used in our paper. We builda non-parametric, flexible, Gaussian Process (GP) regression model thatrelies on past dengue incidence counts and climate covariates, and showthat the GP model performs accurately, in comparison with the other ex-isting methodologies, thus proving to be a good tactical and robust modelfor health authorities to plan their course of action.
Keywords: epidemic, dengue, non-parametric, Gaussian pro-cess, covariance, kernel, robust, tactical model
Dengue is a fast emerging pandemic-prone viral disease transmitted by
Aedesaegypti and
Aedes albopictus mosquitos. According to the World Health Or-ganisation (WHO), each year, an estimated 390 million dengue infections occurall around the world. Cases across the Americas, South-East Asia and WesternPacific exceeded 1.2 million in 2008 and over 3.2 million in 2015 [21]. Severalprecautionary measures include vector control tools, like controlling mosquitopopulations, however, implementation is a major challenge and effective dengueprevention is rarely achieved, specially in developing countries. Often, it is theemergency vector control operations that is usually applied when an outbreakoccurs, like insecticide fogging. 1 a r X i v : . [ s t a t . A P ] A ug ccurate forecasts of incidence cases, or infected individuals are key to plan-ning and resource allocation of dengue vaccines, medical centres, etc. Previousattempts to model dengue have made use of relatively simple models, such asgeneralised linear model and ARIMA, exploiting the relationship with otherenvironmental variables ([13], [4], [7]). However, most of the times, diseasedynamics are not well understood and such models may fail to capture that[11]. Dengue is closely related to the seasonal changes, rainfall and humidity.Our model is trained on historical incidence data, mean surface temperature,humidity and rainfall, and makes use of a Bayesian non-parametric modellingframework, Gaussian processes (GP), that allows for flexibility in the model,thus being able to forecast the sudden peak increase of the incidence counts. A study and systematic review of existing dengue modelling methods, conductedby Louis et al, [11], has been instrumental in providing us with an overview ofcurrent modelling efforts and their limitations. The study enlists a wide rangeof predictors that were used to create dengue risk maps, like socioeconomicand demographic data, climatic and environmental data, remote sensing andentomological data.Several efforts ([4], [13], [7], [6]) have made use of parametric models likelogistic regression models, multinomial models and generalised linear modelswith climatic covariates as inputs. Climatic data have been found to be par-ticularly useful for the generation of risk maps [2]. Additionally, other factors,such as human mobility or housing conditions, are also likely to be linked tothe occurrence of dengue cases [8]. In contrast to the fields of malaria andother vector borne diseases, the study shows that dengue is particularly chal-lenging due to the high number of non-detectable breeding sites ([12], [5], [18],[9]). There have been models making use of entomological data ([20], [3], [17],[18], [19], [14], [1]) studying the link between several vector aspects (like larvaeabundance, ovi-trapping) and dengue cases, however the exact nature of therelationship remains unknown. Such surveys are not only labor-intensive andcostly, but they also yield spurious results that are not useful for prediction([3]).The weakness of current dengue prediction efforts originates from the factthat dengue is highly dynamic and multifactorial. Factors like host immunity,genetic diversity of circulating viruses, etc also play an important role, butthey are difficult and expensive to track. They continue to pose challengesand limit the ability to produce accurate and effective risk maps and models,thus failing to support the development of an early warning system. Manyepidemiological models have developed and have gained importance in the lastdecade, however, they cannot be used in the public health context due to theircomplexity and the extensive need for input data. Additionally, most modelsproduce an average forecast of the numbers in the long run, instead of beingable to predict the immediate rise and fall of the incidence counts. On account2f this, we aim to build a robust but easily implementable model that woulddepend only on climatic variables and past historical data. The GP model hasan added advantage of generalising the model to include several other factorsthat affects dengue, like human mobility, vector data, etc.
We have chosen Singapore as our case study due to the free availability ofweekly dengue incidence counts. Data for the years 2005-2017 was downloadedfrom [22]. Rainfall, relative humidity and surface air temperature data wasalso downloaded from NEA website. 2005-2016 have been used as a trainingset, whereas the year 2017 was used as a test set. The incidence data are finalcounts, i.e. the total number of cases in each week. In order to avoid overfittingor under-fitting, we have implemented the process of k-fold cross validation(k=10) to help us understand the performance of the model, and select anappropriate one.Figure 1 shows Singapore incidence counts across 2005-2017. It clearly de-picts a yearly circle. Not only does the data have a mean response which varieswith time, but also the variability of the incidence counts is unequal across themonths. The dengue count is a heteroskedastic variable when predicted by themonth number.
This paper proposes to model dengue incidence with Gaussian processes (GP),a non-parametric modelling framework [15], for the purpose of getting an addedflexibility and making accurate predictions of the peak season, as it falls andrises. One can think of GP as defining a distribution over functions, and infer-ence takes place directly in the space of functions.A GP is completely specified by its mean function m ( x ) and the covariancefunction k ( x , x (cid:48) ) of a process f ( x ) , and we write f ( x ) ∼ GP ( m ( x ) , k ( x , x (cid:48) )) . Wefirst assume our model to be of the form y = f ( x ) + (cid:15), where (cid:15) is additive andindependent identically distributed Gaussian noise with variance σ n . From our training data, we know { ( x i , y i ) | i = 1 , ..., n } , where n is the totalnumber of observations. The joint distribution of the training outputs, y , andthe test outputs f ∗ according to the prior is (cid:20) yf ∗ (cid:21) ∼ N (cid:18) , (cid:20) K ( X , X ) + σ ∗ n I K ( X , X ∗ ) K ( X ∗ , X ) K ( X ∗ , X ∗ ) (cid:21)(cid:19) . If there are n ∗ test points then K ( X , X ∗ ) denotes the n × n ∗ matrix of thecovariances evaluated at all pairs of training ( X ) and test points ( X ∗ ), and3imilarly for the other entries. To get the posterior distribution over functionswe need to restrict this joint prior distribution to contain only those functionswhich agree with the observed data points. The key predictive equations forGaussian process regression are f ∗ | X, y, X ∗ ∼ N ( ¯ f ∗ , cov ( f ∗ )) (1)¯ f ∗ = K ( X ∗ , X )[ K ( X, X ) + σ n I ] − y (2) cov ( f ∗ ) = K ( X ∗ , X ∗ ) − K ( X ∗ , X )[ K ( X, X ) + σ n I ] − K ( X, X ∗ ) . (3)By this definition, GPs allow us to obtain the exact predictive distributionthrough a closed-form expression. They are also flexible, since one can useany positive semi-definite kernel as the covariance function K as a measureof similarity between points, providing rich insights about the dependenciesbetween them.Under the Gaussian process model, the prior is Gaussian, f | X ∼ N (0 , K ) , or log p ( f | X ) = − f T K − f −
12 log | K | − n π (4)and the likelihood is a factorised Gaussian y | f ∼ N ( f, σ n I ) . We thus arrive atthe log marginal likelihood aslog p ( y | X ) = − y T ( K + σ n I ) − y −
12 log | K + σ n I | − n π. (5)We estimate the hyper-parameters of K by maximising the marginal likeli-hood (or minimising the negative log likelihood). We can use several gradient-based optimisers, since it is necessary to compute the partial derivatives of themarginal likelihood w.r.t. the hyper-parameters. For our purpose, we use the“BFGS” method.We apply a logarithmic (one plus) transformation on the response variableand model this transformation as a GP. This is done to ensure that the largestvariances are stabilised. The main task in modelling via GPs is to define anappropriate covariance structure. We assume a zero-mean GP by centering theresponse variable about its mean. Covariance functions encode our assumptions about the function which we wishto learn. It is a basic assumption that input points which are “close” to eachother are likely to have similar target values y . Based on this, training pointsthat are close to a test point should provide information about the prediction atthat point. It is the covariance function that defines this nearness or similarity.A complex covariance function is derived by combining several different kindsof simple covariance functions. The covariance structure imposed by the GPprior should reflect what we expect from the data. We make use of standard4ernels defined in the GP literature [15]. Our goal is to model the transformedincidence counts as a function of x i = ( x , x , x , x ) i , i.e. the i th observa-tion and its corresponding month number, total monthly rainfall, mean relativehumidity and mean surface air temperature respectively.To enforce the assumption that the test input is highly correlated with itspre-ceding inputs, we use a 5/2 Matern Kernel which is defined as k ( x i , x j ) = σ (cid:32) √ xl + 5∆ x l (cid:33) exp (cid:32) −√ xl (cid:33) (6)where ∆ x = | x i − x j | is the absolute distance between the inputs. Its hyper-parameters, σ and l are used to control the strength of correlation signal andthe span of time that should correlate, respectively.We use a second component to exploit the periodicity observed in dengueincidence, while still giving more importance to closer periods of time. k ( x i , x j ) = k ( x i , x j ) × k ( x i , x j ) (7) k ( x i , x j ) = σ exp (cid:18) − ∆ x l (cid:19) (8) k ( x i , x j ) = exp (cid:18) − sin (cid:18) π ∆ xp (cid:19) /l per (cid:19) (9) k is a squared-exponential kernel (also called radial basis function kernel) and k is a periodic kernel. The hyper-parameters of k - l and σ are used tocontrol the number of months that should impact the incidence and strength ofthe correlation signal respectively. p and l per , the hyper-parameters of k areused to control the periodicity and length-scale of the signal respectively.Next, we model the small irregularities with a rational quadratic term. Therational quadratic kernel allows us to model the data varying at multiple scales. k ( x i , x j ) = σ (cid:18) x αl (cid:19) − α (10) σ is the magnitude, α > l is the characteristiclength-scale.Finally, we specify the noise model as the sum of a squared exponential con-tribution and an independent component. Noise in the series could be due tomeasurement inaccuracies. It could also be due to the changes in weather phe-nomena every year, hence we assume that there is a little amount of correlationin time. k ( x i , x j ) = σ f exp (cid:18) − ∆ x l (cid:19) + σ n δ x i x j (11)where σ f is the signal variance, l is its length scale and σ n is the magnitude ofthe independent noise component. 5he final covariance function is k ( x i , x j ) = k ( x i , x j ) + k ( x i , x j ) + k ( x i , x j ) + k ( x i , x j ) (12)with 12 hyper-parameters.Note that most of the above defined covariance functions are stationary, i.e.invariant to translations in the input space. Sampson and Guttorp, [16], intro-duced the method of warping in 1992, which allows us to introduce an arbitrarynon linear map u ( x ) of the input space x , and then use stationary covariancefunctions in the u ( x ) space. This is yet another reason for the logarithmictransformation of the incidence counts.The code has been run on R, mainly using the GauPro package. All the abovementioned kernels are imported from the kergp package. The training data isthen fit and the marginal likelihood is optimised using the “BFGS” algorithm.
We compare our GP model with 3 different existing methodologies- time seriesforecasting (Arima), generalised additive models (GAM) and predictions fromrandom forests (RF), on the basis of two different metrics- root mean squarederror (RMSE) and mean absolute deviation (MAD). The performance acrossvarious methods is reported as follows, for both in sample and out sample fore-casts. Model Training RSME Test RMSE Training MAD Test MADGAM 0.608 0.682 0.623 0.883Time series 0.521 0.562 0.493 0.512Random Forest 0.676 0.719 0.713 1.101GP 8.3e-07 0.260 9.63e-07 0.262Overfitting is taken care of by cross validation. As can be seen from the aboveperformance metrics, GP is very accurate for forecasting and easily imple-mentable. It also has room for adding more covariates to the model. For theout-of-sample forecasts, Figure 2 shows the predictions for the year 2017.
As we have seen in the last section, the model fit by Gaussian process servesas a good tactical model. This is due to its non-parametric nature and itsflexibility, thus being able to automatically adapt to different scenarios. Theother advantage of this model is the nature of the input, incidence numberscorrelated with climate variables. In the future, we would like to investigatethe role of human and vector factors in helping us forecast dengue incidencein a public health context. The GP model can accommodate such factors by6ntroducing kernel functions based on human and vector interactions and addit to the already defined kernel function in this paper.The GP model provides a sufficient window for health authorities to beaware of the incoming dengue counts and hence carefully plan and take necessaryactions. Its easy implementation can act as a very accurate early warning systemwhen implemented on a weekly basis.To make the model functional on a weeklybasis, one may consider data on a weekly scale spatially and consider resourceallocation and facility problems to effectively implement an operational model.
References [1] Adams B, Kapan DD: Man Bites Mosquito: Understanding the Contribu-tion of Human Movement to Vector-Borne Disease Dynamics.
PLoS One vol4:e6763 (2003).[2] Albinati, J., Wagner, M., Pappa, G.L.: An accurate Gaussian process-based early warning system for dengue fever, arXiv: 1608.03343v1 , [stat.AP](2016).[3] Banu S, Hu W, Hurst C, Tong S.: Dengue transmission in the Asia-Pacificregion: impact of climate change and socio-environmental factors,
Trop MedInt Health. , vol. 16, 598-607 (2011).[4] Chen, S.C., Hsieh, M.H.: Modeling the transmission dynamics of denguefever: implications of temperature effects,
Science of the Total Environment, vol. 431, pp. 385-391, (2012).[5] Dambach P, Machault V, Lacaux JP, Vignolles C, Sie A, Sauerborn R.:Utilization of combined remote sensing techniques to detect environmentalvariables influencing malaria vector densities in rural West Africa,
Int JHealth Geogr. , vol 11(1476-072X (Electronic)):8-20, (2012).[6] Dayama, P., Kameshwaran, S.: Predicting the dengue incidence in Singaporeusing Univariate Time series models,
Annual Symposium proceedings, AMIA ,285-92, (2013).[7] Gharbi, M., Quenel, P., Gustave, J., Cassadou, S., Ruche, G.L., Girdary, L.,and Marrama, L.: Time series analysis of dengue incidence in Guade- loupe,French West Indies: forecasting models using climate variables as predictors,
BMC infectious diseases, vol. 11, no. 1, p. 166, (2011).[8] Hii YL, Zhu H, Ng N, Ng LC, Rocklov J.: Forecast of Dengue Incidence UsingTemperature and Rainfall,
Plos Neglect Trop Dis. , vol 6:e1908, (2012).[9] Jancloes M, Thomson M, Costa MM, Hewitt C, Corvalan C, Dinku T, LoweR, Hayden M.: Climate Services to Improve Public Health,
Int J EnvironRes Public Health , vol. 11, 4555-4559, (2014).7 denguecases$Month dengue c a s e s $ M on t h l y _ C oun t s as.factor(denguecases$Year) Figure 1: Monthly incidence of dengue vs month across years 2005-20178 . . . . Predictions for the year 2017
Month l og ( I n c i den c e C oun t s ) Actual ValuesPredicted Values
Figure 2: Gaussian model on the test set910] Johnson, L.R., Gramacy , R.B., Cohen, J., Mordecai, E., Murdock, C.,Rohr, J., Ryan, S.J., Stewart-Ibarra, A.M., Weikel, D.:Phenomenologicalforecasting of dengue incidence using heteroskedastic Gaussian processes: adengue study, arXiv 2017 , (2017).[11] Louis, V. R., Phalkey, R., Horstick, O., Ratanawong, P., Wilder-Smith, A.,Tozan, Y., and Dambach, P.: Modeling tools for dengue risk mapping - asystematic review,
International Journal of Health Geographics , vol. 13, no.1, pp. 1-15, (2014).[12] Machault V, Vignolles C, Pag`es F, Gadiaga L, Tourre YM, Gaye A, SokhnaC, Trape J-F, Lacaux J-P, Rogier C.: Risk mapping of Anopheles gambiaes.l. densities using remotely-sensed environmental and meteorological datain an urban area: Dakar, Senegal.
PLoS One , vol 7:e50674, (2012).[13] Naish, S., Dale, P., Mackenzie, J.S., Mengersen, K. and Tong, S.: Cli-mate change and dengue: a critical and systematic review of quantitativemodelling approaches,
BMC infectious diseases , vol. 14, no. 1, p 167, (2014).[14] Nevai AL, Soewono E :, A model for the spatial transmission of dengue withdaily movement between villages and a city,
Math Med Biol vol 30:dqt002,(2013).[15] Rasmussen, C.E. and Williams, C.K.I:
Gaussian processes for machinelearning , The MIT Press, (2006).[16] Sampson, P.D. and Guttorp, P.: Nonparametric estimation of nonstation-ary spatial covariance structure,
Journal of the American Statistical Asso-ciation , vol. 87, no. 417, pp. 108-119, (1992).[17] Syed M, Saleem T, Syeda U-R, Habib M, Zahid R, Bashir A, RabbaniM, Khalid M, Iqbal A, Rao EZ, Shujja-ur-Rehman, Saleem S, Knowledge,attitudes and practices regarding dengue fever among adults of high and lowsocioeconomic groups.
J Pak Med Assoc. , vol.3, 243-7.[18] Thomas CJ, Lindsay SW.: Local-scale variation in malaria infectionamongst rural Gambian children estimated by satellite remote sensing,
TransR Soc Trop Med Hyg. , vol. 94, 159-163., (2000).[19] Troyo A, Fuller DO, Calder´on-Arguedas O, Solano ME, Beier JC: Urbanstructure and dengue fever in Puntarenas, Costa Rica.
Singap J Trop Geogr. ,vol 30, 265-282, (2009).[20] Wilder-Smith A, Gubler DJ.: Geographic expansion of dengue: the impactof international travel.
Med Clin North Am.