Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Mauricio Santillana is active.

Publication


Featured researches published by Mauricio Santillana.


PLOS Computational Biology | 2015

Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance

Mauricio Santillana; Andre Nguyen; Mark Dredze; Michael J. Paul; Elaine O. Nsoesie; John S. Brownstein

We present a machine learning-based methodology capable of providing real-time (“nowcast”) and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like illnesses (ILI) activity estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC’s ILI reports. We evaluate the predictive ability of our ensemble approach during the 2013–2014 (retrospective) and 2014–2015 (live) flu seasons for each of the four weekly time horizons. Our ensemble approach demonstrates several advantages: (1) our ensemble method’s predictions outperform every prediction using each data source independently, (2) our methodology can produce predictions one week ahead of GFT’s real-time estimates with comparable accuracy, and (3) our two and three week forecast estimates have comparable accuracy to real-time predictions using an autoregressive model. Moreover, our results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizons.


American Journal of Preventive Medicine | 2014

What Can Digital Disease Detection Learn from (an External Revision to) Google Flu Trends

Mauricio Santillana; D. Wendong Zhang; Benjamin M. Althouse; John W. Ayers

BACKGROUND Google Flu Trends (GFT) claimed to generate real-time, valid predictions of population influenza-like illness (ILI) using search queries, heralding acclaim and replication across public health. However, recent studies have questioned the validity of GFT. PURPOSE To propose an alternative methodology that better realizes the potential of GFT, with collateral value for digital disease detection broadly. METHODS Our alternative method automatically selects specific queries to monitor and autonomously updates the model each week as new information about CDC-reported ILI becomes available, as developed in 2013. Root mean squared errors (RMSEs) and Pearson correlations comparing predicted ILI (proportion of patient visits indicative of ILI) with subsequently observed ILI were used to judge model performance. RESULTS During the height of the H1N1 pandemic (August 2 to December 22, 2009) and the 2012-2013 season (September 30, 2012, to April 12, 2013), GFTs predictions had RMSEs of 0.023 and 0.022 (i.e., hypothetically, if GFT predicted 0.061 ILI one week, it is expected to err by 0.023) and correlations of r=0.916 and 0.927. Our alternative method had RMSEs of 0.006 and 0.009, and correlations of r=0.961 and 0.919 for the same periods. Critically, during these important periods, the alternative method yielded more accurate ILI predictions every week, and was typically more accurate during other influenza seasons. CONCLUSIONS GFT may be inaccurate, but improved methodologic underpinnings can yield accurate predictions. Applying similar methods elsewhere can improve digital disease detection, with broader transparency, improved accuracy, and real-world public health impacts.


Proceedings of the National Academy of Sciences of the United States of America | 2015

Accurate estimation of influenza epidemics using Google search data via ARGO.

Shihao Yang; Mauricio Santillana; S. C. Kou

Accurate real-time tracking of influenza outbreaks helps public health ocials make timely and meaningful decisions that could save lives. We propose a new influenza tracking model, ARGO (AutoRegression with GOogle search data), that uses publicly available online search data. In addition to having a rigorous statistical foundation, ARGO outperforms all previously available tracking models, including the latest version of Google Flu Trends (GFT), even though it uses only low-quality search data as input from publicly available Google Trends and Google Correlate websites. ARGO not only incorporates the seasonality in influenza epidemics, but also captures changes in people’s online search behavior over time. ARGO is also flexible, self-correcting, robust and scalable, making it a potentially powerful tool that can be used for real-time tracking of other social events at multiple temporal and spatial resolutions.Significance Big data generated from the Internet have great potential in tracking and predicting massive social activities. In this article, we focus on tracking influenza epidemics. We propose a model that utilizes publicly available Google search data to estimate current influenza-like illness activity level. Our model outperforms all available Google-search–based real-time tracking models for influenza epidemics at the national level of the United States, including Google Flu Trends. Our model is flexible, self-correcting, robust, and scalable, making it a potentially powerful tool that can be used for estimation and prediction at multiple temporal and spatial resolutions for other social events. Accurate real-time tracking of influenza outbreaks helps public health officials make timely and meaningful decisions that could save lives. We propose an influenza tracking model, ARGO (AutoRegression with GOogle search data), that uses publicly available online search data. In addition to having a rigorous statistical foundation, ARGO outperforms all previously available Google-search–based tracking models, including the latest version of Google Flu Trends, even though it uses only low-quality search data as input from publicly available Google Trends and Google Correlate websites. ARGO not only incorporates the seasonality in influenza epidemics but also captures changes in people’s online search behavior over time. ARGO is also flexible, self-correcting, robust, and scalable, making it a potentially powerful tool that can be used for real-time tracking of other social events at multiple temporal and spatial resolutions.


PLOS Neglected Tropical Diseases | 2014

Evaluation of Internet-Based Dengue Query Data: Google Dengue Trends

Rebecca Tave Gluskin; Michael A. Johansson; Mauricio Santillana; John S. Brownstein

Dengue is a common and growing problem worldwide, with an estimated 70–140 million cases per year. Traditional, healthcare-based, government-implemented dengue surveillance is resource intensive and slow. As global Internet use has increased, novel, Internet-based disease monitoring tools have emerged. Google Dengue Trends (GDT) uses near real-time search query data to create an index of dengue incidence that is a linear proxy for traditional surveillance. Studies have shown that GDT correlates highly with dengue incidence in multiple countries on a large spatial scale. This study addresses the heterogeneity of GDT at smaller spatial scales, assessing its accuracy at the state-level in Mexico and identifying factors that are associated with its accuracy. We used Pearson correlation to estimate the association between GDT and traditional dengue surveillance data for Mexico at the national level and for 17 Mexican states. Nationally, GDT captured approximately 83% of the variability in reported cases over the 9 study years. The correlation between GDT and reported cases varied from state to state, capturing anywhere from 1% of the variability in Baja California to 88% in Chiapas, with higher accuracy in states with higher dengue average annual incidence. A model including annual average maximum temperature, precipitation, and their interaction accounted for 81% of the variability in GDT accuracy between states. This climate model was the best indicator of GDT accuracy, suggesting that GDT works best in areas with intense transmission, particularly where local climate is well suited for transmission. Internet accessibility (average ∼36%) did not appear to affect GDT accuracy. While GDT seems to be a less robust indicator of local transmission in areas of low incidence and unfavorable climate, it may indicate cases among travelers in those areas. Identifying the strengths and limitations of novel surveillance is critical for these types of data to be used to make public health decisions and forecasting models.


Journal of Medical Internet Research | 2014

A Case Study of the New York City 2012-2013 Influenza Season With Daily Geocoded Twitter Data From Temporal and Spatiotemporal Perspectives

Ruchit Nagar; Qingyu Yuan; Clark C. Freifeld; Mauricio Santillana; Aaron Nojima; Rumi Chunara; John S. Brownstein

Background Twitter has shown some usefulness in predicting influenza cases on a weekly basis in multiple countries and on different geographic scales. Recently, Broniatowski and colleagues suggested Twitter’s relevance at the city-level for New York City. Here, we look to dive deeper into the case of New York City by analyzing daily Twitter data from temporal and spatiotemporal perspectives. Also, through manual coding of all tweets, we look to gain qualitative insights that can help direct future automated searches. Objective The intent of the study was first to validate the temporal predictive strength of daily Twitter data for influenza-like illness emergency department (ILI-ED) visits during the New York City 2012-2013 influenza season against other available and established datasets (Google search query, or GSQ), and second, to examine the spatial distribution and the spread of geocoded tweets as proxies for potential cases. Methods From the Twitter Streaming API, 2972 tweets were collected in the New York City region matching the keywords “flu”, “influenza”, “gripe”, and “high fever”. The tweets were categorized according to the scheme developed by Lamb et al. A new fourth category was added as an evaluator guess for the probability of the subject(s) being sick to account for strength of confidence in the validity of the statement. Temporal correlations were made for tweets against daily ILI-ED visits and daily GSQ volume. The best models were used for linear regression for forecasting ILI visits. A weighted, retrospective Poisson model with SaTScan software (n=1484), and vector map were used for spatiotemporal analysis. Results Infection-related tweets (R=.763) correlated better than GSQ time series (R=.683) for the same keywords and had a lower mean average percent error (8.4 vs 11.8) for ILI-ED visit prediction in January, the most volatile month of flu. SaTScan identified primary outbreak cluster of high-probability infection tweets with a 2.74 relative risk ratio compared to medium-probability infection tweets at P=.001 in Northern Brooklyn, in a radius that includes Barclay’s Center and the Atlantic Avenue Terminal. Conclusions While others have looked at weekly regional tweets, this study is the first to stress test Twitter for daily city-level data for New York City. Extraction of personal testimonies of infection-related tweets suggests Twitter’s strength both qualitatively and quantitatively for ILI-ED prediction compared to alternative daily datasets mixed with awareness-based data such as GSQ. Additionally, granular Twitter data provide important spatiotemporal insights. A tweet vector-map may be useful for visualization of city-level spread when local gold standard data are otherwise unavailable.


American Journal of Public Health | 2015

Flu near you: Crowdsourced symptom reporting spanning 2 influenza seasons

Mark S. Smolinski; Adam W. Crawley; Kristin Baltrusaitis; Rumi Chunara; Jennifer M. Olsen; Oktawia Wojcik; Mauricio Santillana; Andre Nguyen; John S. Brownstein

OBJECTIVES We summarized Flu Near You (FNY) data from the 2012-2013 and 2013-2014 influenza seasons in the United States. METHODS FNY collects limited demographic characteristic information upon registration, and prompts users each Monday to report symptoms of influenza-like illness (ILI) experienced during the previous week. We calculated the descriptive statistics and rates of ILI for the 2012-2013 and 2013-2014 seasons. We compared raw and noise-filtered ILI rates with ILI rates from the Centers for Disease Control and Prevention ILINet surveillance system. RESULTS More than 61 000 participants submitted at least 1 report during the 2012-2013 season, totaling 327 773 reports. Nearly 40 000 participants submitted at least 1 report during the 2013-2014 season, totaling 336 933 reports. Rates of ILI as reported by FNY tracked closely with ILINet in both timing and magnitude. CONCLUSIONS With increased participation, FNY has the potential to serve as a viable complement to existing outpatient, hospital-based, and laboratory surveillance systems. Although many established systems have the benefits of specificity and credibility, participatory systems offer advantages in the areas of speed, sensitivity, and scalability.


JMIR public health and surveillance | 2016

Utilizing Nontraditional Data Sources for Near Real-Time Estimation of Transmission Dynamics During the 2015-2016 Colombian Zika Virus Disease Outbreak

Maimuna S. Majumder; Mauricio Santillana; Sumiko R. Mekaru; Denise P. McGinnis; Kamran Khan; John S. Brownstein

Background Approximately 40 countries in Central and South America have experienced local vector-born transmission of Zika virus, resulting in nearly 300,000 total reported cases of Zika virus disease to date. Of the cases that have sought care thus far in the region, more than 70,000 have been reported out of Colombia. Objective In this paper, we use nontraditional digital disease surveillance data via HealthMap and Google Trends to develop near real-time estimates for the basic (R0) and observed (Robs) reproductive numbers associated with Zika virus disease in Colombia. We then validate our results against traditional health care-based disease surveillance data. Methods Cumulative reported case counts of Zika virus disease in Colombia were acquired via the HealthMap digital disease surveillance system. Linear smoothing was conducted to adjust the shape of the HealthMap cumulative case curve using Google search data. Traditional surveillance data on Zika virus disease were obtained from weekly Instituto Nacional de Salud (INS) epidemiological bulletin publications. The Incidence Decay and Exponential Adjustment (IDEA) model was used to estimate R0 and Robs for both data sources. Results Using the digital (smoothed HealthMap) data, we estimated a mean R0 of 2.56 (range 1.42-3.83) and a mean Robs of 1.80 (range 1.42-2.30). The traditional (INS) data yielded a mean R0 of 4.82 (range 2.34-8.32) and a mean Robs of 2.34 (range 1.60-3.31). Conclusions Although modeling using the traditional (INS) data yielded higher R0 estimates than the digital (smoothed HealthMap) data, modeled ranges for Robs were comparable across both data sources. As a result, the narrow range of possible case projections generated by the traditional (INS) data was largely encompassed by the wider range produced by the digital (smoothed HealthMap) data. Thus, in the absence of traditional surveillance data, digital surveillance data can yield similar estimates for key transmission parameters and should be utilized in other Zika virus-affected countries to assess outbreak dynamics in near real time.


PLOS Neglected Tropical Diseases | 2017

Forecasting Zika Incidence in the 2016 Latin America Outbreak Combining Traditional Disease Surveillance with Search, Social Media, and News Report Data

Sarah McGough; John S. Brownstein; Jared B. Hawkins; Mauricio Santillana

Background Over 400,000 people across the Americas are thought to have been infected with Zika virus as a consequence of the 2015–2016 Latin American outbreak. Official government-led case count data in Latin America are typically delayed by several weeks, making it difficult to track the disease in a timely manner. Thus, timely disease tracking systems are needed to design and assess interventions to mitigate disease transmission. Methodology/Principal Findings We combined information from Zika-related Google searches, Twitter microblogs, and the HealthMap digital surveillance system with historical Zika suspected case counts to track and predict estimates of suspected weekly Zika cases during the 2015–2016 Latin American outbreak, up to three weeks ahead of the publication of official case data. We evaluated the predictive power of these data and used a dynamic multivariable approach to retrospectively produce predictions of weekly suspected cases for five countries: Colombia, El Salvador, Honduras, Venezuela, and Martinique. Models that combined Google (and Twitter data where available) with autoregressive information showed the best out-of-sample predictive accuracy for 1-week ahead predictions, whereas models that used only Google and Twitter typically performed best for 2- and 3-week ahead predictions. Significance Given the significant delay in the release of official government-reported Zika case counts, we show that these Internet-based data streams can be used as timely and complementary ways to assess the dynamics of the outbreak.


Scientific Reports | 2016

Evaluating the performance of infectious disease forecasts: A comparison of climate-driven and seasonal dengue forecasts for Mexico

Michael A. Johansson; Nicholas G. Reich; Aditi Hota; John S. Brownstein; Mauricio Santillana

Dengue viruses, which infect millions of people per year worldwide, cause large epidemics that strain healthcare systems. Despite diverse efforts to develop forecasting tools including autoregressive time series, climate-driven statistical, and mechanistic biological models, little work has been done to understand the contribution of different components to improved prediction. We developed a framework to assess and compare dengue forecasts produced from different types of models and evaluated the performance of seasonal autoregressive models with and without climate variables for forecasting dengue incidence in Mexico. Climate data did not significantly improve the predictive power of seasonal autoregressive models. Short-term and seasonal autocorrelation were key to improving short-term and long-term forecasts, respectively. Seasonal autoregressive models captured a substantial amount of dengue variability, but better models are needed to improve dengue forecasting. This framework contributes to the sparse literature of infectious disease prediction model evaluation, using state-of-the-art validation techniques such as out-of-sample testing and comparison to an appropriate reference model.


PLOS Currents | 2015

2014 ebola outbreak: media events track changes in observed reproductive number.

Maimuna S. Majumder; Sheryl A. Kluberg; Mauricio Santillana; Sumiko R. Mekaru; John S. Brownstein

In this commentary, we consider the relationship between early outbreak changes in the observed reproductive number of Ebola in West Africa and various media reported interventions and aggravating events. We find that media reports of interventions that provided education, minimized contact, or strengthened healthcare were typically followed by sustained transmission reductions in both Sierra Leone and Liberia. Meanwhile, media reports of aggravating events generally preceded temporary transmission increases in both countries. Given these preliminary findings, we conclude that media reported events could potentially be incorporated into future epidemic modeling efforts to improve mid-outbreak case projections.

Collaboration


Dive into the Mauricio Santillana's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Craig D Smallwood

Boston Children's Hospital

View shared research outputs
Top Co-Authors

Avatar

John H. Arnold

Boston Children's Hospital

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Brian K Walsh

Boston Children's Hospital

View shared research outputs
Top Co-Authors

Avatar

Clint Dawson

University of Texas at Austin

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Fred Lu

Boston Children's Hospital

View shared research outputs
Top Co-Authors

Avatar

Maimuna S. Majumder

Massachusetts Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge