[PDF] A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models

Abstract

We present a timely and novel methodology that combines disease estimates from mechanistic models with digital traces, via interpretable machine-learning methodologies, to reliably forecast COVID-19 activity in Chinese provinces in real-time. Specifically, our method is able to produce stable and accurate forecasts 2 days ahead of current time, and uses as inputs (a) official health reports from Chinese Center Disease for Control and Prevention (China CDC), (b) COVID-19-related internet search activity from Baidu, (c) news media activity reported by Media Cloud, and (d) daily forecasts of COVID-19 activity from GLEAM, an agent-based mechanistic model. Our machine-learning methodology uses a clustering technique that enables the exploitation of geo-spatial synchronicities of COVID-19 activity across Chinese provinces, and a data augmentation technique to deal with the small number of historical disease activity observations, characteristic of emerging outbreaks. Our model's predictive power outperforms a collection of baseline models in 27 out of the 32 Chinese provinces, and could be easily extended to other geographies currently affected by the COVID-19 outbreak to help decision makers.

Full PDF

AA machine learning methodology for real-timeforecasting of the 2019-2020 COVID-19 outbreakusing Internet searches, news alerts, and estimatesfrom mechanistic models

Dianbo Liu † , , , Leonardo Clemente † , , , , Canelle Poirier † , , ,Xiyu Ding , , Matteo Chinazzi , Jessica T Davis , Alessandro Vespignani , ,Mauricio Santillana ∗ , , , Computational Health Informatics Program, Boston Children’s Hospital, Boston MA 02215 Department of Pediatrics, Harvard Medical School, Boston MA 02215 Tecnolgico de Monterrey, 64849, Monterrey, N.L., Mexico Harvard T.H. Chan School of Public Health, Boston MA 02215 Laboratory for the Modeling of Biological and Socio-technical Systems,

Northeastern University, Boston, MA USA ISI Foundation, Turin, Italy ∗ Corresponding author: Mauricio Santillana, [email protected] † These authors contributed equally to this paper.

Abstract

We present a timely and novel methodology that combines disease estimates from mech-anistic models with digital traces, via interpretable machine-learning methodologies, toreliably forecast COVID-19 activity in Chinese provinces in real-time. Speciﬁcally, ourmethod is able to produce stable and accurate forecasts 2 days ahead of current time, anduses as inputs (a) ofﬁcial health reports from Chinese Center Disease for Control and Pre-vention (China CDC), (b) COVID-19-related internet search activity from Baidu, (c) newsmedia activity reported by Media Cloud, and (d) daily forecasts of COVID-19 activityfrom GLEAM, an agent-based mechanistic model. Our machine-learning methodologyuses a clustering technique that enables the exploitation of geo-spatial synchronicities ofCOVID-19 activity across Chinese provinces, and a data augmentation technique to dealwith the small number of historical disease activity observations, characteristic of emerg-ing outbreaks. Our model’s predictive power outperforms a collection of baseline models a r X i v : . [ s t a t . O T ] A p r n 27 out of the 32 Chinese provinces, and could be easily extended to other geographiescurrently affected by the COVID-19 outbreak to help decision makers. Introduction

First detected in Wuhan, China, in December 2019, the SARS-CoV-2 virus had rapidly spreadby late January 2020 to all Chinese provinces and many other countries ( ). On January 30,2020, the World Health Organization (WHO) issued a Public Health Emergency of InternationalConcern (PHEIC) ( ); and on March 11th, the WHO declared a pandemic for the CoronavirusDisease (COVID-19) ( ). By April 6, 2020, the virus had affected more than 1,200,000 peopleand caused the deaths of 70,000 in more than 180 countries. ( ).In the last decade, methods that leverage data from Internet-based data sources and data fromtraditional surveillance systems have emerged as a complementary alternative to provide nearreal-time disease activity (e.g. Inﬂuenza, dengue) estimates ( ). Despite the fact that thesemethodologies have successfully addressed delays in the availability of health reports as well ascase count data quality issues, developing predictive models for an emerging disease outbreaksuch as COVID-19 is an even more challenging task ( ). There are multiple reasons for this,for example, the availability of epidemiological information for this disease is scarce (there isno historical precedent about the behavior of the disease), the daily/weekly epidemiological re-ports that become available are frequently revised and corrected retrospectively to account formistakes in data collection and reporting (a common practice in public health reports), and thepresence of a diverse array of uncertainties about disease burden due in part to underreportingof cases ( ).Most efforts to estimate the time evolution of COVID-19 spread and the effect of public healthinterventions have relied on mechanistic models that parameterize transmission and epidemio-logical characteristics to produce forecasts of disease activity (

16, 17 ). In contrast, few studieshave investigated ways to track COVID-19 activity, leveraging internet search data (

1, 13, 18 ),and none to the best of our knowledge have combined Internet-based data sources and mecha-nistic estimates to forecast COVID-19 activity.

Our contribution : We present a novel methodology that combines mechanistic and machine-learning methodologies to successfully forecast COVID-19 in real-time at the province levelin China. Augmented ARGONet uses a data-driven approach to incorporate inputs from (a)ofﬁcial health reports from China CDC, (b) COVID-19-related internet search activity fromBaidu, (c) news media activity reported by Media Cloud, and (d) daily forecasts of COVID-19 activity from GLEAM, an agent-based mechanistic model ( ). Inspired by a methodol-ogy previously used to successfully forecast seasonal inﬂuenza in the United States at the statelevel ( ) and previous methods to monitor emerging outbreaks ( ), Augmented ARGONet2s capable of reliably forecasting COVID-19 activity even when limited historical disease activ-ity observations are available. From a methodological perspective, the novelty in our approachcomes from a clustering technique that enables the exploitation of geo-spatial synchronicities ofCOVID-19’s activity across Chinese provinces, and a data augmentation technique to mitigatethe scarcity of historical data for model-training. Results

We produced 2-day ahead (strictly out-of-sample) and real-time COVID-19 forecasts for 32Chinese provinces, for the time period spanning February 3, 2020 to February 21, 2020. Inother words, our study was not retrospective, it was conducted as the outbreak unfolded. Theforecasting (case aggregation) time horizon of our predictions was chosen to be 2 days to en-hance the signal to noise ratio. Based on geo-spatial synchronicities in COVID-19 activity ineach province during the (dynamically increasing) training period, groups of Chinese provinceswere clustered, every two days, to train a separate machine learning model per cluster to pro-duce predictions. COVID-19 historical information from ofﬁcial health reports, Baidu searchengine frequencies, news reports from Media Cloud, and mechanistic model estimates werethen included into separate datasets for each cluster of provinces. Additionally, a data augmen-tation procedure aimed at increasing the size of each cluster’s training dataset was implementedto deal with the scarce number of observations during this relatively-short (local) outbreak. Avisual representation of our out-of-sample model forecasts is shown in Figure 1 along with thesubsequently observed COVID-19 cases, as reported by China CDC.To quantitatively evaluate the predictive power of Augmented ARGONet, we compared ourforecasts with a collection of baseline forecasts obtained with (a) a “persistence model” thatforecasts that the number of new COVID-19 cases (in the next two days) will exactly be thenumber of cases observed in the past two days (used as the baseline for all models), (b) anautoregressive model that only uses historical COVID-19 activity as input, and (c) an enhancedversion of ARGONet, without the incorporation of mechanistic model outputs (see details inMaterials and Methods section). Our results show that Augmented ARGONet, outperforms thepersistence model in 27 out of the 32 Chinese provinces. Even in provinces where AugmentedARGONet failed to produce improvements to the baseline model, our model produced reason-able disease estimates as seen in Figure 1. These provinces include Shanxi, Liaoning, Taiwan,Hong Kong and Guangxi; the latter three with very different administration (and likely health-care) systems compared to the rest of the provinces.

Models built with only local data do not produce signiﬁcantly better predictions than thebaseline.

We analyzed the performance of models built using only local province-level epi-demiological data as input. We generated an autoregressive (AR) model for each province, builton COVID-19 cases that occurred in the previous 3 autoregressive lags (i.e. the previous 3,3-day reports), and compared our estimates with the baseline. Our results, presented in Figure2 labeled AR, show that the autoregressive model predictive power was overall inferior to thebaseline’s performance, with exception to Jilin, Tianjin, Hebei, Hubei and Heilongjiang. Sub-sequently, we incorporated local disease-related internet search information from Baidu andnews alert data from media cloud as inputs to build ARGO-type models ( ). These ARGO-typemodels showed marginal predictive power improvements when compared with autoregressivemodels (AR) and only outperformed the baseline in 7 provinces. Dynamic clustering of Chinese provinces.

Based on prior work on Inﬂuenza activity predic-tion ( ), we added historical COVID-19 activity information for all Chinese provinces to theinput of our local models. We calculated the pairwise correlation matrix for conﬁrmed COVID-19 cases between all Chinese provinces, between February 1 and February 21 2020 (Figure S1).Our results showed that most of the provinces experienced similar epidemic trends. To buildour (clustered) predictive models we combined the data available from several provinces withsimilar trends (in terms of correlation, which was strictly calculated within our training periodat the time-step of prediction). The clustering modeling approach, which incorporated Internet-based data sources as the ARGO-type models, produced forecasts that led to error reductionsfor 17 out of 32 provinces compared to the persistence model, and improved correlation valuesin 20 out of 32 provinces. Data augmentation.

As an additional way to increase the number of observations in the train-ing set of each cluster, we implemented a data augmentation technique. This process consistedof generating new observations via a Bootstrap method and addition of random Gaussian noise( (cid:15) ∼ N (0 , . ) to every randomly selected observation.The results of incorporating both clustering and augmentation techniques can be seen in Figure2. For simplicity, we labeled these predictions ARGONet, even though this implementationof ARGONet is an enhanced version speciﬁcally designed for emerging outbreaks where datais scarce. In terms of RMSE, our results show that ARGONet’s predictive power was able tooutperform AR and the persistence model in 25 of the 32 Chinese provinces. In terms of corre-lation, ARGONet outperformed the baseline (persistence) model in 18 provinces. Augmented ARGONet.

We included forecasts produced by mechanistic model as an addi-tional input in our models (prior to the clustering and augmentation steps). The results ofincorporating these estimates can be seen in Figure 2 with the name of Augmented ARGONet.Our results show that the inclusion of mechanistic model estimates improved ARGONet’s pre-dictive power across the majority of provinces. Augmented ARGONet led to error reductions in27 out of 32 provinces compared to the baseline. In terms of correlation, it improved in 26 outof 32 provinces. Provinces like Qinghai, Hunan and Jiangxi showed the biggest improvement,whereas Taiwan, Hong Kong, Shanxi and Liaoning did not display error reductions.4s an alternative way to visualize Augmented ARGONet’s predictive performance, we plotted amap with Chinese provinces (Figure 3) color-coded based on the improvement shown in Figure2. From a geographical perspective, the provinces where Augmented ARGONet had the mostimprovement (Anhui, Jiangxi, Fujian, Sichuan, Guangdong) were located in South Central ofChina. Shanxi, Liaoning, Taiwan, Hong Kong, and Guangxi are the provinces where our mod-els were not able to reduce the error compared to the baseline. While Augmented ARGONet’sperformance in these provinces was not superior to the baseline, its predictions were within areasonable range as seen in Figures 1 and 2. We suspect that the reason why performance in Tai-wan, Hong Kong, and Guangxi did not improve is due to the different administrative (and likelyhealthcare) systems of these provinces. We were not able to perform any analysis on Tibet, oneof the largest province in China, and Macau given their low count of detected COVID-19 cases.

Analysis of the importance of the sources of information over time.

To minimize the predic-tion errors in our estimates, the dynamic design of our methodology utilizes different sourcesof information as needed over time. This means that, for each province (or group of provinceswithin a cluster), we can quantify the predictive power of different features used in our modelsas time evolves. Our analysis, visualized in Figure 4, shows that historical COVID-19 con-ﬁrmed cases, and suspected cases were consistently relevant sources of information over mostof the study period. Internet-based search terms from Baidu were also frequently used. Dailynews counts were used by our models in a selected number of provinces. However, for manyof these provinces, the importance of media article counts decrease over time. Estimates frommechanistic models contributed to our model prediction, especially in early February, 2020.5 N u m b e r o f C a s e s Anhui

Shanghai

Sichuan

Fujian

COVID-19 Confirmed Cases Baseline Augmented ARGONet Training Period0100

Jiangxi

Guangdong

Qinghai

Henan N u m b e r o f C a s e s Yunnan

Xinjiang

Gansu

Shaanxi

Guizhou

Zhejiang

Jilin

Jiangsu N u m b e r o f C a s e s Tianjin

Hebei

Shandong

Hainan

Inner Mongolia

Ningxia

Beijing

Heilongjiang N u m b e r o f C a s e s Hunan

Hubei

Chongqing

Guangxi - -

18 2020 - -

03 2020 - - Prediction Date

Hong Kong - -

18 2020 - -

03 2020 - - Liaoning - -

18 2020 - -

03 2020 - - Prediction Date

Shanxi - -

18 2020 - -

03 2020 - - Taiwan

Figure 1: Graphical visualization of the number of new conﬁrmed cases for COVID-19, asreported by China CDC, along with Augmented ARGONet (solid red) 2-day ahead of timeestimates between February 3 2020, through February 21 2020. As a comparison, the dottedblue line represents the persistence model. 6 nhu i S h a n g h a i S i c hu a n F u ji a n J i a n g x i G u a n gd o n g Q i n g h a i H e n a n Y unn a n X i n ji a n g G a n s u S h aa n x i G u i z h o u Z h e ji a n g J ili n J i a n g s u T i a n ji n H e b e i S h a n d o n g H a i n a n I nn e r M o n g o li a N i n g x i a B e iji n g H e il o n g ji a n g H un a n H u b e i C h o n gq i n g G u a n g x i H o n g K o n g L i a o n i n g S h a n x i T a i w a n E rr o r I m p r o v e m e n t ( R M S E B A S E L I N E R M S E M O D E L ) Augmented ARGONet ARGONet AR Q i n g h a i G a n s u T a i w a n X i n ji a n g H e b e i N i n g x i a H u b e i H o n g K o n g J ili n H e il o n g ji a n g T i a n ji n A nhu i J i a n g x i S h a n g h a i Y unn a n H a i n a n H e n a n G u i z h o u S i c hu a n S h a n x i F u ji a n J i a n g s u S h aa n x i G u a n g x i G u a n gd o n g H un a n Z h e ji a n g B e iji n g S h a n d o n g L i a o n i n g C h o n gq i n g I nn e r M o n g o li a C o rr e l a t i o n I m p r o v e m e n t | M o d e l B a s e li n e | Augmented ARGONet ARGONet AR

Figure 2: Comparison of the improvement in terms of RMSE (top) and Pearson Correlation(bottom) for each model used in the study. To facilitate comparison between model scores ineach province in terms of RMSE, we normalized the RMSE score of each model by the base-line’s RMSE and visualized its inverse value. In this way, scores above 1 imply an improvement(RMSE reduction), whereas score below 1 imply the model had a bigger RMSE in comparisonto the baseline. In the case of correlation, we plotted the difference between the absolute valuesbetween each model’s correlation and the baseline. Each panel is ordered, from left to right,based on the metric performance of Augmented ARGONet (solid red).7igure 3: Geographical visualization of the relative improvement (

RMSE model

RMSE baseline ). Chineseprovinces that show an increase in performance relative to the baseline are shaded green, whileprovinces that did not perform better than our baseline are shaded purple. Provinces with thehighest improvement (Anhui, Shanghai, Sichuan, Fujian, Jiangxi, Guangdong, Qinghai) andunderperformance (Taiwan, Shanxi, Liaoning, Hong Kong, and Guangxi) are identiﬁed by ared dot over the province. 8 nterceptConfirmed Cases (t-1)Confirmed Cases (t-3)Cumulative confirmed cases (t)Media Cloud Count (t)Baidu Term "how many degree is fever" (t)Mechanistic Model Estimate (t)

Anhui Shanghai Sichuan Fujian Jiangxi Guangdong Qinghai

Confirmed Cases (t)Confirmed Cases (t-2)New suspected cases (t)Cumulative suspected cases (t)Baidu Term "coronavirus symptom" (t)Baidu Term "symptom of fever" (t)

Henan Yunnan Xinjiang Gansu Shaanxi Guizhou Zhejiang

InterceptConfirmed Cases (t-1)Confirmed Cases (t-3)Cumulative confirmed cases (t)Media Cloud Count (t)Baidu Term "how many degree is fever" (t)Mechanistic Model Estimate (t)

Jilin Jiangsu Tianjin Hebei Shandong Hainan Inner Mongolia

Confirmed Cases (t)Confirmed Cases (t-2)New suspected cases (t)Cumulative suspected cases (t)Baidu Term "coronavirus symptom" (t)Baidu Term "symptom of fever" (t)

Ningxia Beijing Heilongjiang Hunan - -

03 2020 - -

07 2020 - -

11 2020 - -

15 2020 - - Hubei - -

03 2020 - -

07 2020 - -

11 2020 - -

15 2020 - - Chongqing - -

03 2020 - -

07 2020 - -

11 2020 - -

15 2020 - - Guangxi - -

03 2020 - -

07 2020 - -

11 2020 - -

15 2020 - - InterceptConfirmed Cases (t-1)Confirmed Cases (t-3)Cumulative confirmed cases (t)Media Cloud Count (t)Baidu Term "how many degree is fever" (t)Mechanistic Model Estimate (t)

Hong Kong - -

03 2020 - -

07 2020 - -

11 2020 - -

15 2020 - - Liaoning - -

03 2020 - -

07 2020 - -

11 2020 - -

15 2020 - - Shanxi - -

03 2020 - -

07 2020 - -

11 2020 - -

15 2020 - - Taiwan

Figure 4: Time evolution of the value (averaged over the 20 experiments) of the linear co-efﬁcients for the features used in our methodology, visualized per province. Every heatmapincludes the same number of features (rows) and is organized in the same order.9 iscussion

We presented a methodology capable of producing meaningful and reliable short-term (2-dayahead) forecasts of COVID-19 activity, at the province level in China, by combining informa-tion from reports from China CDC, Internet search trends, news article trends, and informationfrom mechanistic models. Our approach is capable of overcoming multiple challenges charac-teristic of emerging outbreaks caused by novel pathogens. These challenges include: the lack ofhistorical disease activity information to calibrate models, the low volume of case count data,and the inherent delay to gain access to data. Methodologically speaking, our method maxi-mizes the use of limited number of observations as the outbreak unfolded by (a) choosing anappropriate aggregation time-window (2 days) to improve the signal to noise ratio , (b) lever-aging synchronicities in the spatio-temporal trends in COVID-19, across provinces, to producecluster-speciﬁc models of prediction, and (c) using data augmentation methods to increase thestability in the training of our models.Previous methods, such as the ARGONet model (

11, 21 ), have been shown to make accuratereal-time prediction, at the state-level in the USA for seasonal infectious diseases such as In-ﬂuenza. In addition, Chinazzi et al. ( ), showed that it was possible to estimate the evolution ofan emerging outbreak using a mechanistic model. Nevertheless, as far as we know, reliable real-time methodologies to forecast new case counts for an emerging disease outbreak remained anunsolved problem. In this study, we showed that dynamically trained machine learning modelcan produce accurately real-time estimates for COVID-19 outbreak.In terms of prediction error, our proposed methodology, Augmented ARGONet, was able tooutperform the persistence model in 27 out of 32 provinces. While our method does not showprediction error improvements in Guangxi, Liaoning, Shanxi, Taiwan, and Hong Kong, ourforecasts are still within range in all provinces except for Taiwan, where very few cases werereported during the time-period of this study. It is important to note that Taiwan , Hong Kongand Guangxi have different administrative (and likely healthcare) systems compared with otherprovinces. This, could explain the differences in COVID-19 trends in these regions and couldhelp explain why our models do not seem to add value to the persistence model. Features stud-ies should investigate if incorporating disease activity estimates from other mechanistic models,designed with likely different assumptions and mathematical formulations, could lead to furtherimprovements.We were unable to identify an accurate (daily) parametrization of changes in human mobilitydue to the widespread local lockdowns during the period of our study (February 3-21, 2020),and thus, we did not include this data source as a potential predictor. Future studies may in-corporate (high temporal resolution) human mobility data as a modulator of transmission andpredictor of disease activity. When looking at the entire time-period of this study; however,we observed that the data-driven clustering of provinces used in our approach and based on10OVID-19 activity, appears to have similarities with the clustering one would obtain usinghuman mobility data made available by Baidu ( ) (Figure S1). This result aligns with the con-clusion of other available studies that ﬁnd that the time evolution of the COVID-19 outbreakin China, was importantly inﬂuenced by changes in human mobility (consequence of publichealth interventions) (

16, 17, 23 ), and in the early stages, associated to the percentage of peopletraveling from Wuhan.Our ﬁndings suggest that, it is possible to use very limited amounts of data from multiple datasources to conduct real time forecasting in the early stage of an emerging outbreak. We believethat our method, Augmented ARGONet, could prove to be useful for public health ofﬁcialsto monitor (and perhaps prevent) the spread of the virus (

8, 11, 24, 25 ). As the SARS-CoV-2virus continues to spread around the world, extensions of our methods could be implemented toprovide timely and reliable disease activity estimates to decision makers.

Materials and Methods

Data sources.

We used the following data sources for our study: COVID-19 activity reportsfrom China Centers for Disease Control and Prevention (CDC), Internet search frequenciesfrom Baidu, the number of related news reports from 311 media sources, as reported by theMedia Cloud platform, and COVID-19 daily forecasts from the Global Epidemic and MobilityModel. We provide details about each of these data sources in the next paragraphs.

Daily reports of COVID-19.

Case counts of COVID-19 were obtained from China CDC .This data is curated and publicly available via the Models of Infectious Disease Agent Study(MIDAS) association (github.com/midas-network/COVID-19/tree/master/data/cases). All datawere collected on the original date they became available. Indeed, case counts released byChina CDC can be revised, up to several weeks later. In this study, we only used unreviseddata, which is the real case scenario to produce real time estimates. The reports, available forall the provinces, include various activity trends such as: new diagnosed cases, new suspectedcases, and new reported deaths. For our study, we selected the number of conﬁrmed cases asthe epidemiological target, and collected activity reports from January 10, 2020 to February 21,2020.

Baidu Internet search activity.

We collected the daily search fraction for three differentCOVID-19-related search terms in Mandarin (translated in English they are ”COVID-19 symp-toms”, ”how many degrees” and ”symptoms of fever”). These terms were selected based ontheir correlation and potential association with case counts of COVID-19 ( ), and collectedindividually for each province from January 1, 2020 to February 21, 2020. Our decision to useInternet activity as a source of information is based on the hypothesis that search frequenciesfrom COVID-19-related keywords reﬂect, to an extent, the number of people presenting symp-toms related to COVID-19 before their arrival to a clinic. Given Baidu imposes limits to the data11ccess for researchers, we were unable to conduct a broad analysis on a wide range keywords.A visualization of the Baidu search term timeseries can be seen in S3. News reports . An online open-source platform called Media Cloud, that allows the track-ing and analysis of media for any topic of interest through the matching of keywords, was used.We obtained volumes of the number of news articles available over time from a collection of311 Chinese media websites using the keywords “coronavirus” ,“COVID-19”, “2019-nCoV”,“pneumonia”,“fever”,“cough” and the name of each province to generate province-speciﬁcnews activity trends. Media data from January 1, 2020 to February 21, 2020 were collectedand used as an additional source information.

Aggregation of daily reports . To enhance signal and reduce noise, we aggregated case count,searches volume and media articles count for each δt = Clustering.

We clustered the 32 provinces into several groups and train a model for eachgroup. Clustering and model retraining processes were repeated on every single new predictiondate. To know the similarity of outbreak pattern among Chinese provinces, we calculated thepairwise correlation matrix for conﬁrmed COVID-19 cases by using all historical data available.Then, based on similarity matrix, provinces were clustered by using complete linkage hierar-chical clustering, which is an agglomerative hierarchical clustering method, creating clustersbased on most dissimilar pairs. ( ). The number of clusters K was determined by choosing theK, maximizing Calinski-Harabasz index ( ). Data Augmentation.

We conducted data augmentation by using bootstrap method to re-sampleeach data point of the training dataset. We made 100 bootstrap samples for each data point towhich we added a random Gaussian noise with a mean of 0 and a standard deviation of 0.01.Due to stochasticity of both clustering algorithm and model training processing, on each pre-diction day, we run the whole clustering-training process twenty times and take an average ofthe outputs as our ﬁnal prediction.

Predictive model . For our prediction task, we ﬁtted a LASSO multi-variable regularized lin-ear model for every data set generated from our clustering and augmentation steps at time t .The LASSO technique minimizes the mean squared error between observations and predictionssubject to a L1 norm constraint. The number of new conﬁrmed COVID-19 cases for the next12i-day can be then expressed as: y T + δt = (cid:88) i =0 α i y T − iδt + βS T + γM T + δD T + ψC T + (cid:15) T + δt (1)where • y T + δt is the estimate at date T + δt ,where δt = 2 days • y T is the number of cases at date T . • S T is the searches volume at date T . • M T is the number of media article at date T . • D T is the number of deaths at date T . • C T is the number of cumulative cases at date T . • (cid:15) T + δt is the normally distributed error term.Model were dynamically recalibrated, similar to the method presented in Santillana et al. 2014( ) and Lu et al. 2019 ( ). Our method was implemented in R 3.5.3 environment with glmnet3.0-2 library. Global Epidemic and Mobility Model (GLEAM)

GLEAM, the mechanistic model used inthis study, is an individual-based, stochastic, and spatial epidemic model (

16, 30–32 ). GLEAMuses a meta-population network approach which was integrated with real-world data. In GLEAM,the world population is divided into sub-populations centered around major transportation hubs(usually airports). The sub-populations are connected by the ﬂow of individuals traveling dailyamong them. Over 3,000 sub-populations in about 200 different countries and territories areincluded in the model. The airline transportation data consider daily origin-destination trafﬁcﬂows obtained from the Ofﬁcial Aviation Guide (OAG) and IATA databases (updated in 2019).Ground mobility ﬂows are derived by the analysis and modeling of data collected from thestatistics ofﬁces for 30 countries on 5 continents ( ). Mobility variations in Mainland Chinawere adapted from information from Baidu Location-Based Services (LBS). Within each sub-population, the human-to-human transmission of COVID-19 is modeled using a compartmentalrepresentation of the disease where each individual can occupy one of the following four states:Susceptible ( S ), Latent ( L ), Infectious ( I ) and Removed ( R ). Susceptible individuals can ac-quire the virus through contacts with individuals in the infectious state, and become latent,meaning they are infected but can not transmit the infection yet. Latent individuals progress tothe infectious stage with a rate inversely proportional to the latent period. Infectious individualsprogress into the removed stage with a rate inversely proportional to the infectious period. The13um of the average latent and infectious periods deﬁnes the generation time. Removed individ-uals represent those who can no longer infect others, meaning they were isolated, hospitalized,died, or have recovered.The model produces an ensemble of possible epidemic scenarios described by the number ofnewly generated infections, times of disease arrival in each subpopulation, and the number oftraveling infection carriers. We make an assumption that a starting date of the epidemic thatfalls between November, 15 2019 and December 1, 2019, with 40 cases IS caused by zoonoticexposure ( ). The transmission dynamic is calibrated by using an Approximate BayesianComputation approach to estimate the posterior distribution of the basic reproductive number R ( ). We assume that the world has a detection of imported cases as low as 40% (

38, 39 ).Data on importation of cases were derived from currently available published line lists (

40, 41 ). Performance of model and relevance of predictors.

Two different metrics were used tomeasure the performance of our predictive model : 1) The root mean square error (RMSE),2) The Pearson correlation. To assess the predictive power of our methodology, we comparedour performance against the following models:1. Persistence rule (Baseline): A rule-based model that uses the new case count at date T asestimate of prediction for T + δt so that ( y T + δt = y T )2. Autoregressive (AR): A simple autoregressive model built on COVID-19 cases that oc-curred in the previous 3 autoregressive lags (2-day reports). Please refer to SupplementaryMaterials for more information on this model.3. ARGONet : An alternate version of our methodology that does not include any mecha-nistic information.As linear models are used in this study, relevance of predictors in predicting new cases can bedeﬁned thanks to the associated factor of each term in the trained model. As all data were nor-malized using the z-score (strictly within the training datasets) during training and prediction,the associated factor can be approximately understood as, how many standard deviations thepredicted new cases y T + δt will change if 1 standard deviation change in the predictor. Acknowledgments

We thank Dr. Wei Luo for his assistance and guidance on the interpretation of mobility data forChinese provinces. MC and AV acknowledge support from Google Cloud Healthcare and LifeSciences Solutions via GCP research credits program.14 unding statement

CP, AV, and MS were partially supported by the National Institute of General Medical Sciencesof the National Institutes of Health under Award Number R01GM130668. The content is solelythe responsibility of the authors and does not necessarily represent the ofﬁcial views of theNational Institutes of Health.

Author contributions

DL, LC, CP, AV, and MS conceived and designed the study. DL, LC, CP, XD, MC collected thedifferent data sources. MC, JTD, and AV, produced predictions using the GLEAM modelingplatform. DL, LC, CP, implemented the Augmented ARGONet methodology. DL, LC, CP,and MS analyzed the results. DL, LC, CP, and MS wrote the ﬁrst draft of the manuscript. Allauthors contributed to and approved the ﬁnal version of the manuscript.

Competing interests

The authors have declared no competing interest.

Data sharing

All codes and data will be made available via the Harvard dataverse.

References and Notes

1. Li, Q. et al.

Early transmission dynamics in wuhan, china, of novel coronavirus–infectedpneumonia.

New England Journal of Medicine (2020).2. Zhu, N. et al.

A novel coronavirus from patients with pneumonia in china, 2019.

NewEngland Journal of Medicine (2020).3. Chan, J. F.-W. et al.

A familial cluster of pneumonia associated with the 2019 novel coro-navirus indicating person-to-person transmission: a study of a family cluster.

The Lancet (2020).4. Gilbert, M. et al.

Preparedness and vulnerability of african countries against introductionsof 2019-ncov. medRxiv (2020). 15. WHO. Statement on the second meeting of the international health regulations (2005)emergency committee regarding the outbreak of novel coronavirus (2019-ncov). URL . [Online; accessed 18-February-2020].6. Du, Z. et al.

Risk of 2019 novel coronavirus importations throughout china prior to thewuhan quarantine. medRxiv (2020).7. Sun, K., Chen, J. & Viboud, C. Early epidemiological analysis of the 2019-ncov outbreakbased on a crowdsourced data. medRxiv (2020).8. Wang, C., Horby, P. W., Hayden, F. G. & Gao, G. F. A novel coronavirus outbreak of globalhealth concern.

The Lancet (2020).9. Yang, S., Santillana, M. & Kou, S. C. Accurate estimation of inﬂuenza epidemics usinggoogle search data via argo.

Proceedings of the National Academy of Sciences , 14473–14478 (2015).10. Santillana, M. et al.

Combining search, social media, and traditional data sources to im-prove inﬂuenza surveillance.

PLoS computational biology (2015).11. Lu, F. S., Hattab, M. W., Clemente, C. L., Biggerstaff, M. & Santillana, M. Improved state-level inﬂuenza nowcasting in the united states leveraging internet-based data and networkapproaches. Nature communications , 1–10 (2019).12. Cleaton, J. M., Viboud, C., Simonsen, L., Hurtado, A. M. & Chowell, G. Characterizingebola transmission patterns based on internet news reports. Clinical Infectious Diseases ,24–31 (2016).13. Li, C. et al. Retrospective analysis of the possibility of predicting the covid-19 outbreakfrom internet searches and social media data, china, 2020.

Eurosurveillance , 2000199(2020).14. Lipsitch, M. & Santillana, M. Enhancing situational awareness to prevent infectious diseaseoutbreaks from becoming catastrophic. Curr. Top. Microbiol. Immunol. , 59–74.15. Li, R. et al.

Substantial undocumented infection facilitates the rapid dissemination of novelcoronavirus (sars-cov2).

Science (2020).16. Chinazzi, M. et al.

The effect of travel restrictions on the spread of the 2019 novel coron-avirus (covid-19) outbreak.

Science (2020).17. Lai, S. et al.

Effect of non-pharmaceutical interventions for containing the covid-19 out-break: an observational and modelling study. medRxiv (2020).168. Jung, S.-m. et al.

Real-time estimation of the risk of death from novel coronavirus (covid-19) infection: Inference using exported cases.

Journal of Clinical Medicine , 523 (2020).19. Aiken, E. L. et al. Real-time estimation of disease activity in emerging outbreaks usinginternet search information. medRxiv

PLoS neglected tropical diseases ,e0005295 (2017).21. Poirier, C. et al. Inﬂuenza forecasting for the french regions by using ehr, web and climaticdata sources with an ensemble approach argonet. medRxiv et al.

The effect of human mobility and control measures on the covid-19 epidemic in china.

Science (2020).24. Wu, J. T., Leung, K. & Leung, G. M. Nowcasting and forecasting the potential domestic andinternational spread of the 2019-ncov outbreak originating in wuhan, china: a modellingstudy.

The Lancet (2020).25. Phan, L. T. et al.

Importation and human-to-human transmission of a novel coronavirus invietnam.

New England Journal of Medicine (2020).26. Xiaoxuan, L., Qi, W. & Benfu, L. Can search query forecast successfully in china’s 2019-ncov pneumonia? medRxiv (2020).27. Defays, D. An efﬁcient algorithm for a complete link method.

The Computer Journal ,364–366 (1977).28. Cali´nski, T. & Harabasz, J. A dendrite method for cluster analysis. Communications inStatistics-theory and Methods , 1–27 (1974).29. Santillana, M., Nsoesie, E. O., Mekaru, S. R., Scales, D. & Brownstein, J. S. Using clin-icians search query data to monitor inﬂuenza epidemics. Clinical infectious diseases: anofﬁcial publication of the Infectious Diseases Society of America , 1446 (2014).30. Balcan, D. et al. Modeling the spatial spread of infectious diseases: The global epidemicand mobility computational model.

Journal of computational science , 132–145 (2010).31. Gomes, M. F. et al. Assessing the international spreading risk associated with the 2014West African Ebola outbreak.

PLoS currents (2014).172. Zhang, Q. et al. Spread of Zika virus in the Americas.

Proceedings of the National Academyof Sciences , E4334–E4343 (2017).33. Rambaut, A. Phylogenetic analysis of ncov-2019 genomes (2020). [Online; accessed 05-February-2020].34. Report 3: Transmissibility of 2019-ncov (2020). URL . [Online; accessed 05-February-2020].35. Andersen, K. Estimates of the clock and TMRCA for 2019-nCoV based on 27 genomes(2020). [Online; accessed 05-February-2020].36. Bedford T, Neher R, Hadﬁeld J, Hodcroft E, Ilcisin M, Mller N. Genomic analysis ofnCoV spread. Situation report 2020-01-23 (2020). URL https://nextstrain.org/narratives/ncov/sit-rep/2020-01-23 . [Online; accessed 05-February-2020].37. Sunnker, M. et al.

Approximate bayesian computation.

PLOS Computational Biology (2013).38. Niehus, R., De Salazar, P. M., Taylor, A. & Lipsitch, M. Quantifying bias of covid-19prevalence and severity estimates in wuhan, china that depend on reported cases in interna-tional travelers. medRxiv (2020).39. De Salazar, P. M., Niehus, R., Taylor, A., Buckee, C. O. & Lipsitch, M. Using predictedimports of 2019-ncov cases to determine locations that may not be identifying all importedcases. medRxiv (2020).40. Sun, K., Chen, J. & Viboud, C. Early epidemiological analysis of the coronavirus disease2019 outbreak based on crowdsourced data: a population-level observational study. TheLancet Digital Health (2020).41. Pinotti, F. et al.

Lessons learnt from 288 covid-19 international cases: importations overtime, effect of interventions, underdetection of imported cases. medRxiv (2020).

Supplementary

Mobility data correlation

Human mobility data . As an additional source, we collected the daily human mobility (interms of percentage) between the Chinese provinces via Baidu. We decided to not include hu-man mobility information as a source of information for our methodology given the absence ofdata during February. 18o investigate the impact of human mobility on new conﬁrmed cases, we performed a simi-lar analysis on Baidu mobility data. Baidu mobility data was obtained from a public avail-able archive (https://doi.org/10.18130/V3/YQLJ5W) ( ), which was originallyscraped from (qiangxi.baidu.com) website. The dataset contains daily inbound (i.e.percentage of people traveling to the city from all the cities in China) and outbound (i.e. per-centage of people traveling from the city to all the cities in China) mobility data for all cities(except Hong Kong, Macau and Taiwan) in China on each day from January 1, 2020 to January31, 2020.For this study, we were particularly interested in the relationship between the amount of peopletraveling from Wuhan/Hubei to other provinces and the number of new conﬁrmed cases in thoseprovinces. Given the original data from Baidu is city-level, we ﬁrst extracted all the outbounddata from Wuhan (which is the capital of Hubei and the ﬁrst city with conﬁrmed COVID-19cases) and mapped all the cities belonging to the same province to get province-level outboundmobility from Wuhan. We obtained a 31x31 matrix, where each row is a province in China, andeach column is one day in January, each element shows the percentage of people traveling outof Wuhan to the other provinces, among all people traveling from Wuhan on that day.To understand the similarity of the human mobility from Wuhan among Chinese provincesand investigate the potential impact of mobility on the conﬁrmed cases of each province, wecalculated the pairwise correlation matrix for the percentage of people traveling from Wuhanto each province by using the data from January 1, 2020 to January 31, 2020 (see Figure S1).If the pairwise correlation between the two provinces is high, it indicates that the trend in thepercentage of people traveling from Wuhan to the two provinces are similar. We calculated thepairwise correlation matrix for new conﬁrmed cases using data from February 1, 2020 to Febru-ary 21, 2020, and we found that provinces which are geographically closed are not necessarilyclustered together, indicating geographical distance would not be able to explain the similarityin the new conﬁrmed cases. To further explore the potential impact of mobility on the new con-ﬁrmed cases of each province, we then calculated the Pearson correlation and cosinus similaritybetween the two correlation matrices. To do this, we ﬁrst converted the two matrices into twovectors and calculated the correlation between the two vectors. We got Pearson correlation of0.1473 and cosinus similarity of 0.4729. 19igure S1: Visualization of the pairwise correlation matrix of human mobility from Wuhan toeach Chinese province. During the period of January, we can see a similar trend of mobility fora big cluster of provinces (the mobility matrix has been rearranged in the same province orderthan the COVID-19 case counts correlation matrix to facilitate visualization of the similarity).20 E rr o r ( N u m b e r o f C a s e s ) Anhui

Shanghai

Sichuan

Fujian

COVID-19 Confirmed Cases Baseline Augmented ARGONet ARGONet500

Jiangxi

Guangdong Qinghai

Henan E rr o r ( N u m b e r o f C a s e s ) Yunnan

Xinjiang

Gansu

Shaanxi

Guizhou

Zhejiang

Jilin

Jiangsu E rr o r ( N u m b e r o f C a s e s ) Tianjin

Hebei

Shandong

Hainan

Inner Mongolia

Ningxia

Beijing

Heilongjiang E rr o r ( N u m b e r o f C a s e s ) Hunan

Hubei

Chongqing

Guangxi - -

03 2020 - - Prediction Date

Hong Kong - -

03 2020 - - Liaoning - -

03 2020 - - Prediction Date

Shanxi - -

03 2020 - - Taiwan

Figure S2: Graphical visualization of the out-of-sample COVID-19 error ( ˆ y − y ) between Febru-ary 03, 2020, to February 21, 2020. 21 utoregressive Model An autoregressive model (AR) was built using only COVID-19 local (in contrast to AugmentedARGONet’s clustering methodology) historical conﬁrmed case counts from the past three 2-dayreports. Based on this formulation, our 2-day forecast ( ˆ y t +2 ) of COVID-19 conﬁrmed cases isestimated as follows: y T + δt = (cid:80) i =0 α i y T − iδt + (cid:15) T + δt (2)where • y T + δt is the estimate at date T + δt ,where δt = 2 days • y T is the number of cases at date T . • (cid:15) t ∼ N (0 , σ ) , residualsAs for Augmented ARGONet, our version of AR was dynamically trained.22 .00.51.0 Anhui Beijing Chongqing Fujian Gansu provinceBaidu search ("coronavirus symptoms")Baidu search ("fever how many degrees") Official Province LockdownConfirmed Cases (normalized)0.00.51.0 Guangdong Guangxi Zhuang Guizhou Hainan Hebei0.00.51.0 Heilongjiang Henan Hong Kong Hubei Hunan0.00.51.0 Inner Mongolia Jiangsu Jiangxi Jilin Liaoning0.00.51.0 Macau Ningxia Qinghai Shaanxi Shandong0.00.51.0 Shanghai Shanxi Sichuan Taiwan - -

05 2020 - - Tianjin - -

05 2020 - - - -

05 2020 - - Xinjiang - -

05 2020 - - Yunnan - -