Flu Detector: Estimating influenza-like illness rates from online user-generated content
FFlu Detector: Estimating influenza-like illness ratesfrom online user-generated content
Vasileios Lampos
Department of Computer ScienceUniversity College London [email protected]
Flu Detector’s version: v.0.5 —
Published on:
December 11, 2016
Abstract
We provide a brief technical descrip-tion of an online platform for diseasemonitoring, titled as the
Flu Detector ( fludetector.cs.ucl.ac.uk ). Flu De-tector, in its current version (v.0.5), useseither Twitter or Google search data inconjunction with statistical Natural Lan-guage Processing models to estimate therate of influenza-like illness in the popu-lation of England. Its back-end is a liveservice that collects online data, utilisesmodern technologies for large-scale textprocessing, and finally applies statisticalinference models that are trained offline.The front-end visualises the various dis-ease rate estimates. Notably, the modelsbased on Google data achieve a high levelof accuracy with respect to the most re-cent four flu seasons in England (2012/13to 2015/16). This highlighted Flu Detectoras having a great potential of becoming acomplementary source to the domestic tra-ditional flu surveillance schemes. Information epidemiology, or ‘infodemiology’(Eysenbach, 2009), is evidently not a hypothe-sis anymore. Numerous research efforts in re-cent years have provided proof that user-generateddata, especially in the form of search queries orsocial media, can be used to better understand amulti-faceted collection of health issues. Withinthis rapidly developing field of research, usuallylabelled as Computational Health, one of the mostprominent examples has been the modelling ofinfluenza-like illness (ILI) rates (Polgreen et al.,2008; Ginsberg et al., 2009; Lampos and Cristian-ini, 2010; Culotta, 2010; Paul and Dredze, 2011; Signorini et al., 2011). Attempting to translate re-search results into an actual application, the plat-form of Google Flu Trends (GFT) was launchedin 2008 based on a method proposed by Ginsberget al. (2009) for mapping the frequency of searchqueries to ILI rates in the US. In 2010, Lampos etal. developed the first tool that used social mediacontent to estimate ILI rates in the UK (Lamposet al., 2010). The Flu Detector of that era usedTwitter posts and basic supervised learning mod-els, such as the ‘lasso’ (Tibshirani, 1996; Lamposand Cristianini, 2010) or its bootstrapped version(Bach, 2008; Lampos and Cristianini, 2012), oper-ating on Bag-of-Words representations of the data.Naturally, there was space for further improve-ments, something that has been explored in vari-ous follow-up works (e.g. by Lamb et al. (2013)or Preis and Moat (2014) and so on). In late 2015,amidst severe criticism (Olson et al., 2013; Lazeret al., 2014) and bad press due to significant mis-predictions in the past flu seasons, the GFT servicewas unfortunately discontinued. Advancements in statistical Natural LanguageProcessing (NLP) combined with a better under-standing of the problem have recently led to dis-ease models that overcome past deficiencies (Lam-pos et al., 2015b; Lampos et al., 2015a; Yang et al.,2015). Motivated by this fact, a revamped versionof
Flu Detector ( fludetector.cs.ucl.ac.uk )that has access to both Twitter and Google searchdata has been developed and recently launched.Given that GFT never made ILI rate estimates forEngland (or the UK), Flu Detector embodies thefirst online tool making ILI rate estimations forEngland based on Google search data.To ensure that Flu Detector will not be a one-off scientific outcome, but will have a practi- Its last working snapshop (circa March 2013) is hostedunder twitter.lampos.net/epidemics See google.org/flutrends a r X i v : . [ c s . A I] D ec igure 1: Flu Detector’s weekly ILI estimates for the 2015/16 flu season in England based on Googlesearch data. They are compared to the RCGP ILI rates as released by PHE.cal impact, the inference accuracy as well asthe potential added value of the tool to the cur-rent (traditional) health surveillance schemes havebeen assessed in collaboration with Public HealthEngland (PHE), the leading governmental agencyresponsible for the national health surveillanceschemes. The results of the evaluation are pos-itive, leading to a potential incorporation of FluDetector’s estimates as a complementary indicatorin the weekly flu surveillance reports during thecoming flu seasons.This document summarises the main function-alities of Flu Detector. It should be consideredas an ongoing reference to the online tool, and assuch, it will be updated as new modules are beinglaunched.
The current version of Flu Detector has access totwo online user-generated content sources, namelyTwitter and Google search. The supervised mod-els of ILI for England are trained based on syn-dromic surveillance data.
We collect approximately every exactly geolo-cated tweet in England using Twitter’s Streaming The evaluation will be published separately. Public Health England, gov.uk/government/organisations/public-health-england
API. By “exact geolocation” we refer to tweets,where the geo-coordinates (latitude and longitude)of the user who posted them, are available. Thisamounts to an average of approximately , tweets per day. We note that this number is rela-tively small as, according to our estimates, it repre-sents only - of the entire set of tweets fromusers in England. Hence, the ILI rate inferencesbased on Twitter data are inevitably unstable. Flu Detector has access to a non standardisedversion of the publicly available Google Trendsoutputs through a private Google Health TrendsAPI. This provides (aggregate and anonymised)normalised frequencies of search queries. Morespecifically, a query frequency expresses the prob-ability of a short search session for a specific ge-ographical region and temporal resolution, drawnfrom a uniformly distributed - sample ofall corresponding search sessions. At the moment, Flu Detector models ILI rates asreported by the Royal College of General Practi- Twitter Streaming API, dev.twitter.com/streaming/overview The Google Health Trends API can only be used for aca-demic research with a health-oriented focus. ioners (RCGP) and PHE. The estimates repre-sent the number of doctor consultations reportingILI symptoms per , people in England. Supervised learning techniques are used to modelflu rates from Twitter or Google search data. A se-lection of papers has served as motivation for theactual methods that are employed on the website,from early papers on the topic (Ginsberg et al.,2009; Lampos and Cristianini, 2010) to most re-cent developments (Lampos et al., 2015b; Lamposet al., 2015a; Zou et al., 2016). The applied meth-ods combine these different pieces of knowledgewith advancements in statistical NLP (e.g. the useof neural word embeddings (Mikolov et al., 2013a;Mikolov et al., 2013b)) and, at the moment, are be-ing documented.As a preliminary performance indicator of theGoogle search based model, the average MeanAbsolute Error in year-long weekly ILI rate es-timates across four flu seasons (from 2012/13 to2015/16) is approximately equal to . (in , people) compared to the corresponding RCGP ILIrates; the corresponding average Pearson correla-tion is equal to . . Extensive performance evalu-ation will become available in forthcoming publi-cations. At the back-end of Flu Detector, there is a soft-ware pipeline for data collection, storage and pro-cessing. The latter uses standard Python libraries(e.g. gensim , nltk , numpy , scipy and so on) andthe Apache Hadoop framework for task paralleli-sation. Textual data can be manually processed inbatches (e.g. for model training). In addition, thefrequency of the textual variables used in Flu De-tector’s models is being automatically updated ona daily basis.The ILI estimation models, which are trainedoffline, are used to produce daily (over-night) in-ferences as well as weekly ones. To maintain aconsistency with the data distributions during themodel training phase, where only weekly ILI ratesare available, each estimate on Flu Detector (even See gov.uk/government/statistics/weekly-national-flu-reports Apache Hadoop, hadoop.apache.org the daily ones) uses a week-long set of observa-tions. For example, to estimate the ILI rate of date i , we use the frequencies of textual terms duringthe dates { i, i − , . . . , i − } for the target dataset. For Twitter-driven estimates, which are conse-quently based on a small portion of data, and tendto be noisy, the user of the website can also accesssmoothed versions of the inferred time series.The current version of Flu Detector incorpo-rates Twitter-based models, one focusing onEngland as a whole and the rest in sub-regions(‘London’, ‘North England’, ‘South England’,‘Midlands and East England’). As expected, theregional models are very unstable given the evensmaller data ratio that characterises them. More-over, the platform has a Google search model forEngland only (regional Google search data havenot yet been made available). Given the higherpenetration of Google search in the real popula-tion as well as the significantly larger sample ofsearches that are used to compute search query fre-quencies ( - ), the corresponding estimatesare much more reliable.Apart from its public interface, Flu Detector hasalso an internal one, used for testing new modulesand evaluating estimates compared to traditionalsyndromic surveillance schemes (see Fig. 1). In this brief report, we introduced Flu Detector,an online tool for presenting disease rate estimatesbased on user-generated content. The current ver-sion of Flu Detector uses data from Google searchor Twitter and displays ILI rate estimates for Eng-land. This report will be updated as new function-alities are being launched.Future work includes the consideration of dif-ferent infectious diseases, the incorporation ofmore data sources as well as the development ofunsupervised disease modelling schemes. Strati-fied disease estimates based on perceived user de-mographics, e.g. age (Rao et al., 2010), occu-pation or socioeconomic status (Preot¸iuc-Pietro etal., 2015a; Preot¸iuc-Pietro et al., 2015b; Lamposet al., 2016), as well as the expansion of modelsso as to cover different countries are among ourpriorities.
Acknowledgements
Flu Detector is funded bythe EPSRC project EP/K031953/1 (or i-sense) EPSRC IRC project i-sense, i-sense.org.uk nd by a Google Research sponsorship. V. Lam-pos would like to thank all the people involved inthe various stages of development of Flu Detec-tor and the underlying methods, and in particularI.J. Cox, A.C. Miller, J.K. Geyti, B. Zou, M. Wag-ner and R. Pebody. He would also like to thankPHE, the RCGP and Google for providing data.Credit should also be given to N. Cristianini andT. De Bie who participated in the development ofFlu Detector’s predecessor (Lampos et al., 2010).
References
Francis R. Bach. 2008. Bolasso: Model Consistent LassoEstimation Through the Bootstrap. In
Proc. of the 25th In-ternational Conference on Machine Learning , pages 33–40.Aron Culotta. 2010. Towards Detecting Influenza Epidemicsby Analyzing Twitter Messages. In
Proc. of the 1st Work-shop on Social Media Analytics , pages 115–122.Gunther Eysenbach. 2009. Infodemiology and Infoveillance:Framework for an Emerging Set of Public Health Infor-matics Methods to Analyze Search, Communication andPublication Behavior on the Internet.
Journal of MedicalInternet Research , 11(1):e11.Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, et al.2009. Detecting influenza epidemics using search enginequery data.
Nature , 457(7232):1012–1014.Alex Lamb, Michael J. Paul, and Mark Dredze. 2013. Sepa-rating Fact from Fear: Tracking Flu Infections on Twitter.In
Proc. of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics:HLT , pages 789–795.Vasileios Lampos and Nello Cristianini. 2010. Tracking theflu pandemic by monitoring the Social Web. In
Proc. ofthe 2nd International Workshop on Cognitive InformationProcessing , pages 411–416.Vasileios Lampos and Nello Cristianini. 2012. Nowcast-ing Events from the Social Web with Statistical Learning.
ACM Transactions on Intelligent Systems and Technology ,3(4):1–22.Vasileios Lampos, Tijl De Bie, and Nello Cristianini. 2010.Flu Detector: Tracking Epidemics on Twitter. In
Proc. ofthe 2010 European Conference on Machine Learning andKnowledge Discovery in Databases , pages 599–602.Vasileios Lampos, Andrew C. Miller, Steve Crossan, andChristian Stefansen. 2015a. Advances in nowcastinginfluenza-like illness rates using search query logs.
Sci-entific Reports , 5(12760).Vasileios Lampos, Elad Yom-Tov, Richard Pebody, and Inge-mar J. Cox. 2015b. Assessing the impact of a health inter-vention via user-generated Internet content.
Data Miningand Knowledge Discovery , 29(5):1434–1457.Vasileios Lampos, Nikolaos Aletras, Jens K. Geyti, Bin Zou,and Ingemar J. Cox. 2016. Inferring the SocioeconomicStatus of Social Media Users Based on Behaviour andLanguage. In
Proc. of 38th European Conference on IRResearch , pages 689–695. David Lazer, Ryan Kennedy, Gary King, and AlessandroVespignani. 2014. The Parable of Google Flu: Traps inBig Data Analysis.
Science , 343(6176):1203–1205.Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jeff Dean.2013a. Efficient Estimation of Word Representations inVector Space. In
Proc. of the International Conference onLearning Representations, Workshop Track , pages 1–12.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013b. Distributed Representa-tions of Words and Phrases and their Compositionality. In
Advances in Neural Information Processing Systems 26 ,pages 3111–3119.Donald R. Olson, Kevin J. Konty, Marc Paladini, Cecile Vi-boud, and Lone Simonsen. 2013. Reassessing Google FluTrends Data for Detection of Seasonal and Pandemic In-fluenza: A Comparative Epidemiological Study at ThreeGeographic Scales.
PLOS Computational Biology , 9(10),10.Michael J. Paul and Mark Dredze. 2011. You Are What YouTweet: Analyzing Twitter for Public Health. In
Proc. ofthe 5th International Conference on Weblogs and SocialMedia , pages 265–272.Philip M. Polgreen, Yiling Chen, David M. Pennock, For-rest D. Nelson, and Robert A. Weinstein. 2008. UsingInternet Searches for Influenza Surveillance.
Clinical In-fectious Diseases , 47(11):1443–1448.Tobias Preis and Helen Susannah Moat. 2014. Adaptivenowcasting of influenza outbreaks using Google searches.
Open Science , 1(2).Daniel Preot¸iuc-Pietro, Vasileios Lampos, and Nikolaos Ale-tras. 2015a. An analysis of the user occupational classthrough Twitter content. In
Proc. of the 53rd Annual Meet-ing of the Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 1754–1764.Daniel Preot¸iuc-Pietro, Svitlana Volkova, Vasileios Lampos,Yoram Bachrach, and Nikolaos Aletras. 2015b. StudyingUser Income through Language, Behaviour and Affect inSocial Media.
PLOS ONE , 10(9).Delip Rao, David Yarowsky, Abhishek Shreevats, and Man-aswi Gupta. 2010. Classifying Latent User Attributes inTwitter. In
Proc. of the 2nd International Workshop onSearch and Mining User-generated Contents , pages 37–44.Alessio Signorini, Alberto Maria Segre, and Philip M. Pol-green. 2011. The Use of Twitter to Track Levels of Dis-ease Activity and Public Concern in the U.S. during theInfluenza A H1N1 Pandemic.
PLOS ONE , 6(5).Robert Tibshirani. 1996. Regression Shrinkage and Selec-tion via the Lasso.
Journal of the Royal Statistical Soci-ety: Series B (Statistical Methodology) , 58(1):267–288.Shihao Yang, Mauricio Santillana, and Samuel C. Kou. 2015.Accurate estimation of influenza epidemics using Googlesearch data via ARGO.
Proceedings of the NationalAcademy of Sciences , 112(47):14473–14478.Bin Zou, Vasileios Lampos, Russell Gorton, and Ingemar J.Cox. 2016. On Infectious Intestinal Disease SurveillanceUsing Social Media Content. In