The Interplay of Demographic Variables and Social Distancing Scores in Deep Prediction of U.S. COVID-19 Cases
TThe Interplay of Demographic Variables and Social
Distancing Scores in Deep Prediction of U.S. COVID-19
Cases ∗Francesca Tang † Yang Feng ‡ Hamza Chiheb § Jianqing Fan ¶ January 7, 2021
Abstract
With the severity of the COVID-19 outbreak, we characterize the nature of the growthtrajectories of counties in the United States using a novel combination of spectral cluster-ing and the correlation matrix. As the U.S. and the rest of the world are experiencing asevere second wave of infections, the importance of assigning growth membership to countiesand understanding the determinants of the growth are increasingly evident. Subsequently,we select the demographic features that are most statistically significant in distinguishingthe communities. Lastly, we effectively predict the future growth of a given county with anLSTM using three social distancing scores. This comprehensive study captures the natureof counties’ growth in cases at a very micro-level using growth communities, demographicfactors, and social distancing performance to help government agencies utilize known infor-mation to make appropriate decisions regarding which potential counties to target resources ∗ Tang and Feng contribute equally to this work. † Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ(email: [email protected]). ‡ Department of Biostatistics, New York University, New York City, NY ([email protected]). § New York City, NY ([email protected]). ¶ Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ([email protected]). a r X i v : . [ c s . S I] J a n nd funding to. Keywords:
COVID-19, Stochastic Block Model, Spectral Clustering, Community De-tection, Machine Learning, Neural Network.
The recent infectious disease (COVID-19) caused by severe acute respiratory syndrome coro-navirus 2 (SARS-CoV-2) has overtaken the world as the largest pandemic we have seen indecades. The World Health Organization (WHO) labeled it a pandemic on 03/11/2020, witha total of more than 85 million cases and more than 1.84 million deaths worldwide as of01/04/2021.Forecasting the growth of confirmed cases and the locations of future outbreaks has beena persistent challenge in the public health and statistical fields. With the gravity and ur-gency of the global health crisis, many recent works including Kucharski et al. (2020) andPeng et al. (2020) have attempted to model the growth in cases in various countries. Mostof the literature on statistical modeling of the data focuses on the reproduction number.However, this value is constantly evolving and is not always a valuable measurement to buildprediction models with. Hong and Li (2020) proposed a Poisson model with time-dependenttransmission and removal rates to estimate a time-dependent disease reproduction number.Betensky and Feng (2020) studied the impact of incomplete testing on the estimation ofdynamic doubling time. Ultimately, we need to examine the underlying features contained inthe time series data in order to extract valuable insights into the unique nature of the spreadof COVID-19. As the number of deaths is at least a two-week lagging indicator comparedto the number of confirmed cases, we only look at the latter. More importantly, the matrixof the number of deaths per county would be very sparse at the initial stage, making anyanalysis more difficult. Our goal is to characterize and project the disease progression giventhe limitations of public data from a theoretical yet practical perspective.2tochastic block models (SBMs), first developed by Holland et al. (1983), has long beenstudied as a powerful statistical tool in community detection, where the nodes or mem-bers are partitioned into latent groups. SBMs have been employed to study social networksWasserman and Anderson (1987), brain connectivity Rajapakse et al. (2017), protein signal-ing networks Chen and Yuan (2006), and many others. Under an SBM, the nodes within thesame group usually have a higher probability of being connected versus those from differentgroups. The difficult task is to recover these connectivities and the communities based onone observation, which in our case, is a snapshot of the changes in the number of cases upto the most recent time point. In more recent years, spectral clustering (Balakrishnan et al.,2011; Rohe et al., 2011; Jin, 2015a; Lei and Rinaldo, 2015) has arisen as one of the mostpopular and widely studied approaches to recover these communities. Conventional spectralclustering algorithms mostly involve two steps: eigen-decompose the adjacency or Laplacianmatrix of the data and then apply a clustering algorithm, such as k-means, to the eigenvec-tors that correspond to the largest or smallest eigenvalues. There is extensive literature onsuch procedures, for instance von Luxburg (2007), Ng et al. (2001), Abbe (2017), etc.In this study, we introduce the unique procedure of conducting spectral clustering on thesample Pearson correlation coefficient matrix directly and compare its clusters to the stan-dard Laplacian embedding. This complements Brownlees et al. (2020+)’s approach basedon a latent covariance model on financial return data. Gilbert et al. (2020) used agglomera-tive clustering, an unsupervised learning method, on preparedness and vulnerability data inAfrican countries using self-reported reports of capacity and indicators. While a comprehen-sive study, it only considers the possible exposures to travelers from China. Using a differentdataset, Hu et al. (2020) clustered the data from China by implementing a simple k -meansclustering directly on various features of the provinces/cities and not on the eigenvectors ofthe correlation matrix. It also doesn’t take into account possible explanatory features thataren’t directly related to the number of cases and fails to predict provinces that have yet to3ave cases. The data processing of some existing approaches also do not standardize andshift the data in a way that aligns with the nature of COVID-19.Once the communities are found, the subsequent part uncovers the statistically significantdemographic features, pre-existing in the counties, that could largely explain a county’s com-munity membership. Most of the existing research on salient demographic information focuson age-related features and the presence of co-morbidities or underlying health conditionse.g. Dowd et al. (2020) and Lippi et al. (2020). In reality, what influences how the diseaseprogresses in a county is most likely a confluence of variables, and not one or two prevailingones. Some studies also examine how various demographic determinants affect how well acounty carries out social distancing (Im et al., 2020), but offers little or no connection to thenature of the growth curve.The extracted variables from the feature analysis part are then used in conjunction withtime series of social distancing scores from Unacast (2020) to fit a recurrent neural network,ultimately predicting the progression of confirmed cases in a given county. It is importantto note that for this prediction section, we use the period from the start of the pandemicuntil 07/20/2020 as this traces the first large spike in cases in the U.S. and a subsequentplateau. This gives a long enough time series sample and to include much more recent datawould include the second large wave of the pandemic, which is counter to the objective ofcapturing the growth trajectory of a county’s peak and fall. Unacast has created a scoreboardof social distancing measures with mobile device tracking data, where a device is assigned to aspecific county based on the location the device spent the most amount of time in. The neuralnetwork prediction takes these static, inherent county variables, community membership (theclustering results), and social distancing data to predict the future growth of confirmed cases.There have been several studies that predict, estimate, or model the growth curve of thedisease, including Fanelli and Piazza (2020) on the cases in Italy, Roda et al. (2020) on thecases in Wuhan, and Liu et al. (2020) on the cases in China. In addition, deep learning has4igure 1: Pipeline of this study’s three-part analysis of COVID-19. COVID-19 time series data is firstused to perform community detection, clustering counties into several communities. Then, demographicfeatures are incorporated to extract the most significant features that distinguish the growth communities.Finally, social distancing metric time series are added to the results to the previous two parts to carry outthe prediction of COVID-19 cases for new counties. been applied to COVID-19 research, such as Wang and Wong (2020) that detect positive casesthrough chest scans. Other studies such as Zheng et al. (2020b) investigate when patientsare most infectious by using a deep learning hybrid model and Yang et al. (2020) similarlycombines the epidemiological SIR model with an LSTM network. However, the literature onCOVID-19 still lacks any comprehensive approach on a county-level that creates a throughlineof the pandemic: historical growth curve of confirmed cases, characterization of this growthvia clustering, the significant explanatory demographic features, and finally, social distancingmeasures that give insight into the nature of the future growth trajectory, as displayed inFigure 1. Table 1 also contains the specific time period, the number of counties N , and thedata source used for each part of the paper (Part I: community detection, Part II: extractionof significant features, Part III: prediction) as outlined in Figure 1.5 ime Period(s) No. of counties N SourcePart I 1/22/20 - 4/17/20 and 5/10/20 - 7/10/20 950 Johns Hopkins CSSEPart II 1/22/20 - 4/17/20 and 5/10/20 - 7/10/20 633 ACS (c/o Data Planet)Part III 2/25/20 - 7/10/20 627 Unacast
Table 1:
Time period, number of counties, and data source used for each part of the paper.
The first part of this paper finds potential communities among the counties, in which clustersshare similar growth patterns. To accomplish this, two fundamental concepts are necessary:the stochastic block model and spectral clustering. The former is a generative model throughwhich community memberships were formed and the latter is a methodology often utilized torecover these memberships. Compared to traditional clustering methods, spectral clusteringhas shown to be effective in both statistical and computational efficiency (Abbe, 2017; Abbeet al., 2020). Our approach applies spectral clustering to the correlation matrix, insteadof the commonly used adjacency matrix or Laplacian matrix. The goal is to recover thecounty membership matrix embedded in the correlations of each county’s logarithmic dailycumulative number of cases.
Correlation matrix vs. adjacency matrix
For each county, consider a daily time-series of the cumulative number of confirmed cases,where we use curve registration (the time origin is set as the day on which the number ofcases exceeds 12 for a particular county). This curve registration is important as it takesinto account the fact that counties may have different COVID-19 outbreak starting times.We denote w i,t = log( x i,t ) as the logarithmic cumulative number of cases of county i on the t -th day since the county hit 12 or more cases. Then, we use the Pearson correlation as asimilarity measure, defined as 6 ij = (cid:80) T ij t =1 ( w i,t − ¯ w i )( w j,t − ¯ w j ) (cid:113)(cid:80) T ij t =1 ( w i,t − ¯ w i ) (cid:113)(cid:80) T ij t =1 ( w j,t − ¯ w j ) , (1)where T ij = min( T i , T j ) , with T i and T j being the number of days county i and county j has 12 or more cases, respectively. The sample correlation R ∈ R n × n would then containthe pairwise correlations among all n counties. The logarithmic cumulative case counts areused to align with the exponential growth pattern implied by popular epidemic models. Forexample, we could distinguish between a faster exponential growth function such as exp (2 t ) and a slower growth function exp ( t/ .Another commonly used network representation is the adjacency matrix A , which showswhether two counties are connected and is often constructed based on a similarity measurelike Pearson correlation or a mutual information score. If the graph is undirected, where eachedge that connects two nodes is bidirectional, A is symmetric. The two most common typesof similarity graphs are the (cid:15) -neighborhood graph and the k − nearest neighbor graph. Aswe’re using sample correlation as the similarity measure, an (cid:15) -neighborhood adjacency A isdefined as follows: ( A ) ij = , if R ij ≥ − (cid:15), , otherwise. (2)A k − nearest neighbor adjacency A is defined as follows: ( A ) ij = , if county i is among j ’s k nearest neighborsor if county j is among i ’s k nearest neighbors, , otherwise, (3)where the nearest neighbors are found with respect to R ij .Depending on the parameters (cid:15) and k one chooses for A and A , respectively, a significant7mount of information could be lost in the process because of the thresholding operation.However, this operation also filters out many spurious correlations. Unlike the sparse A and A , R retains all of the pairwise similarities between counties, which would shed more lighton the within-group and between-group relationships. Stochastic Block Model (SBM)
The matrices R , A , and A are critical because they can help us recover Θ , an n × K membership matrix that reflects which community each county belongs to, where K is thenumber of communities. Letting Z i ∈ { , ..., K } be the community that county i belongs to,the i th row of Θ has exactly one 1 in column Z i (the community that county i belongs to) and0 elsewhere. We estimate Θ under an SBM, where the probability two counties are connectedonly depends on the membership of these two counties. An SBM denoted by G ( n, B , Θ ) as n nodes, K communities, and is parameterized by Θ and B , the K × K symmetric connectivitymatrix. Essentially, B contains the inter- and intra-community connection probabilities: theprobability of an edge between counties i and j is B Z i Z j .The objective is to obtain an accurate estimation (cid:98) Θ of Θ from an observed adjacencymatrix A that is modeled as G ( n, B , Θ ) . This yields an recovery of the partitions G k := { i : Z i = k, i = 1 , ..., n } by (cid:98) G k = { i : (cid:98) Z i = k, i = 1 , ..., n } , k = 1 , ..., K , with an ambiguityof permutation of clusters, where (cid:98) Z i indicates the location of 1 in the i th row of (cid:98) Θ . Thepopulation matrix P ∈ R n × n , where P ij is the probability that counties i and j are connected,is naturally expressed as P = Θ B Θ T . Spectral Clustering
Spectral clustering has been a popular choice for community detection (Rohe et al., 2011; Jin,2015a; Lei and Rinaldo, 2015). The central idea is to relate the eigenvectors of the observableadjacency matrix A to those of P = Θ B Θ T , which is not observed. This is accomplished8y expressing A as a perturbation of its expected value: A = E [ A ] + ( A − E [ A ]) . If wetreat E [ A ] as the signal part and A − E [ A ] as the noise, we connect the eigenvectors of A and P using E [ A ] = P − diag( P ) . Noting rank( P ) = K , letting U n × K = [ u , ..., u K ] bethe eigenspace spanned by the K nonzero eigenvalues of E [ A ] , then columns of U span thesame linear space as those spanned by the columns of P (ignoring diag( P ) ). Additionally, P has the same column space as Θ . Now, letting (cid:98) U be the eigenspace corresponding to the K largest absolute eigenvalues of A , then (cid:98) U is a consistent estimate of U or the column spaceof Θ , under some mild conditions. To resolve the ambiguity created by rotation, the k -meanalgorithm is applied to the normalized row of U to identify membership of communities(Rohe et al., 2011; Lei and Rinaldo, 2015).Instead of examining the eigenvalues of A , spectral graph theory has long studied graphLaplacian matrices as a tool of spectral clustering. The symmetric Laplacian matrix isdefined as follows: letting D = diag( d , ..., d n ) be the diagonal degree matrix where d i = (cid:80) nj =1 A ij , then a popular definition of a normalized, symmetric Laplacian matrix is L = I − D − / AD − / . When clustering with L , one takes the eigenvectors corresponding to thesmallest eigenvalues in absolute value.In our context, A can be taken as either A or A as outlined in subsection 2.1. As thereare no exact rules in choosing the parameters (cid:15) and k of A and A , respectively, clusteringwith L , which depends on the adjacency matrix, maybe less than ideal. It is also an added,often computationally cumbersome step. Instead, we cluster directly on the similarity matrix R , the sample correlation matrix. Algorithm 1 delineates the detailed steps of this approach.The classic spectral clustering procedure with L used as a benchmark is outlined in theSupplementary Material.There are several methods for choosing the number of spiked eigenvalues in the contextof factor models: scree-plot, eigen-gap, eigen-ratio, adjusted correlation thresholding. As ourmethod involves correlations, we apply the adjusted correlation method in Fan et al. (2020).9 lgorithm 1 Spectral clustering on correlation matrix
Input
Sample correlation matrix R ∈ R n × n and the number of clusters K . Compute the largest K eigenvectors in absolute value u , ..., u K of R and construct (cid:98) U ∈ R n × K be the matrix with the eigenvectors as columns. Normalize rows of (cid:98) U to have unit norm to get (cid:98) U norm . Cluster the rows of (cid:98) U norm with k -means. return Partition (cid:98) G , ..., (cid:98) G K of the nodes.This method leads to K = 2 , which roughly divides the counties into faster or slower growthcommunities. It also agrees with the choice where we maximize the eigen-gap. Clustering procedure
Figure 2: The left panel is the 10 largest eigenvalues of R . The rightpanel has the same 10 eigenvalues but zoomed-in on the y-axis.To compare and visual-ize the eigenvalues of R ,Figure 2’s left panel dis-plays how dominant thefirst eigenvalue is com-pared to the rest.Figure 3 is a visual-ization of the first twoeigenvectors of R and the linear separation that the algorithm partitioned all the countiesinto. The left panel is with unit norm normalization and the right is without the normaliza-tion. The result of essentially using the signs of the components in the second eigenvectorto cluster reminiscences the work by Abbe et al. (2020) with strong theoretical support.From now on, all clustering analysis will be based on the unit-norm normalization of theeigenvectors. Fastest and Slowest Growth Clusters R and the corresponding clusters, Group 1 in blue andGroup 2 in purple. The right panel depicts the same two clustersbut in the two un-normalized eigenvectors.For future analysis (sec-tion 2.7), it is usefulto define the clustersthat contain the coun-ties with the fastest andslowest growth. Af-ter the clusters areproduced with Algo-rithm 1, for every com-munity k , we calculatethe average exponential growth rates of the counties in that community. This is done byfitting the total number of cases of each county i on day t , x i,t , to x i,t = x i, (1 + r i ) t + ε i,t through nonlinear least squares and obtaining the approximated growth rate r i for county i . Then, we compare the average fitted growth rate (cid:98) r k = 1 / | (cid:98) G k | (cid:80) i ∈ (cid:98) G k r i and standard errorfor clusters k = 1 , ..., K . The fastest growth cluster is defined as argmax k (cid:98) r k and the slowestgrowth cluster is defined as argmin k (cid:98) r k . Data
We use the COVID-19 (2019-nCoV) Data Repository by the Johns Hopkins Center for Sys-tems Science and Engineering (CSSE) that contains data on the number of confirmed casesand deaths in the United States and around the world, broken down by counties in the U.S.The public database is updated daily and the virtual dashboard is also used widely aroundthe world. Data sources of the database include the World Health Organization (WHO),US Center for Disease Control (CDC), BNO News, WorldoMeters, and 1point3acres. Wetake all counties that have or more cumulative cases in the time frame of 01/22/2020 to04/17/2020. We treat the day a county reaches or more confirmed cases as day one and11 roup 1 Group 2 M odel
No. of Counties Growth Rate SE No. of Counties Growth Rate SE R
467 0.1589 0.0020 483 0.1704 0.0019 A
462 0.1583 0.0020 488 0.1677 0.0019 A
470 0.1605 0.0020 470 0.1664 0.0020 R , second phase 487 0.0207 0.0005 463 0.0233 0.0005 Table 2:
Average growth rates and the number of counties in each cluster for K = 2 . Model R correspondsto Algorithm 1 where we use the sample correlation matrix. Model A corresponds to Algorithm 4 wherewe use the k -nearest neighbors graph ( k = 7 ). Model A corresponds to Algorithm 4 where we use the (cid:15) -neighborhood graph ( (cid:15) = 0 . ). Groups 1 and 2 are the obtained partitions (cid:98) G and (cid:98) G , respectively.Growth Rate is the approximated exponential growth rate, calculated as in subsection 2.5. Presented are theaverages of these growth rates and their associated SEs for the counties in two groups, clustered by differentmethods. R , second phase is for the clusters obtained for the period 05/10/2020 - 07/10/2020. Group 1 21 96.1 % % % % Table 3: R average block correlations K = 2 . then discard all counties that have a time series of fewer than 14 days after processing. Thisway we shift each county to a similar starting point in terms number of cases and a longenough period to do a meaningful analysis with. We also remove unassigned cases and U.S.territories, which ultimately results in a total of 950 counties. As mentioned before, we use w i,t = log( x i,t ) to represent the logarithmic cumulative cases for county i on day t .We also repeat the community detection process with more recent data from 05/10/2020,when many states started to reopen, to 07/10/2020. The bulk of this part of the studyconcentrates on the beginning phase of the pandemic given that health and governmentintervention to minimize the number of future cases should be executed as early as possible.However, we compare the resulting communities with more recent data that captures thesecond phase of the pandemic in the U.S. States experienced a significant drop in cases whenlockdown was enforced and businesses were closed but as they began to reopen, the numberof cases saw an uptick once again. Since this second phase comes months after the initialoutbreak, there may be meaningful differences worthy of analysis.12 .7 Results and Discussion
Figure 4: R Heatmap of block correla-tions K = 2 . Model R corresponds to Algo-rithm 1 where we use the sample correlationmatrix. Groups 1 and 2 are the obtainedpartitions (cid:98) G and (cid:98) G , respectively. We can see from Table 2 that for the clusters obtainedby Algorithm 1 ( R ), the difference between the growthrates of Group 1 and Group 2 is the largest. This isvalidated by the discernible difference in trajectories,shown in Figures 5 and 6. The standard error bandsin Figure 5 underscores that the two groups becomemore distinct in their growth trajectory as time goeson: the overlap between the bands of the two groupsdecreases over time. For A and A , the growth ratesare much closer together between the two communi-ties. Furthermore, the right panel of Figure 6 is a plotof the average cases for the period after community detection was performed: 04/17/2020 -09/03/2020. Evidently, the separation between the two groups becomes much more distinctas time goes on (much larger number of cases). As for community detection of the subsequentphase of the pandemic in the U.S. (from 05/10/2020 - 07/10/2020), the last row of Table 2again shows a larger average growth rate for Group 2, albeit much smaller in magnitude sincecases increased at a slower rate once the country learned how to deal with the pandemic.Table 3 contains information on the average intra- and inter-group correlations, a sam-ple reflection of B . Evidently, the intra-community correlations are higher than the inter-community correlations. Group 1’s intra-correlation of 96.1% and Group 2’s 96.8% are greaterthan 94.0%, the inter-group correlation between the two groups. As we only took countieswith significant outbreaks as of 04/17/2020, it is logical to observe high correlations acrossthe board. These results are also mirrored in Figure 4, a heatmap of the block correlations.Some notable counties that are partitioned to Group 2, the fast growth community, includeLos Angeles, CA; San Francisco, CA; District of Columbia; DeKalb, GA; Fulton, GA; Miami-13igure 5: The left panel represents the average cumulative number of cases of the initial phase 01/22/2020- 04/17/2020 with one standard error bands for the clusters of R , K = 2 . The right panel is the averagelog cumulative number of cases of R , K = 2 . The x-axis is in calendar time, which does not account forheterogeneous starting times of the outbreak in each county. Figure 6:
The left panel represents average cumulative number of cases of the initial phase 01/22/2020 -04/17/2020, starting from the first day of at least 12 days for the clusters of R , K = 2 . The right panelis the average cumulative number of cases of the period 04/18/2020 - 09/03/2020, the time frame after theinitial phase used in community detection. The x-axis here accounts for the heterogeneity of the outbreak ofCOVID-19 in each county. Figure 7:
Clusters for model R of the initial phase 01/22/2020 - 04/17/2020. Model R corresponds toAlgorithm 1 where we use the sample correlation matrix. Groups 1 and 2 are the obtained partitions (cid:98) G and (cid:98) G , respectively. Growth curves of clusters obtained from community detection on data from 05/10/2020 -07/10/2020 (recent phase). Plots are the same as those of Figure 6.
Dade, FL; Cook, IL; Jefferson, LA; Suffolk, MA; Bergen, NJ; New York, NY; Westchester,NY; and King, WA, all large epicenters. Figure 7 is a geographical visualization of thecommunities.In addition, Figure 8 shows the same plots as those in Figure 6 but for a later phase.The curves are clearly much flatter in both groups, which is likely due to the increase inthe number of cases plateauing in many counties. Furthermore, the distinction between thecurves of Group 1 and 2 is also considerably bigger than those of the earlier data. This canbe explained by the confluence of additional factors that separate each county’s experiencewith the virus, including the nature of local government intervention, degree, and timing ofre-openings, travel restrictions, etc.
An important and subsequent question that arises once the communities are obtained is whatunderlying factors play a role in which growth cluster a county belongs to. Since the growthof COVID-19 cases is also related to static, inherent factors that aren’t a consequence of thedisease, we examine a variety of county demographic variables and how they differ among15ommunities. In order to select the variables that are most statistically significant, or aremost relevant to the community assignment of a county, we perform independent two-samplet-tests on the fastest and slowest growth groups (subsection 2.5) with respect to variousdemographic variables. The null and alternative hypotheses for this t-test for the d -th featureare as follows: H : µ d, = µ d, , vs. H a : µ d, (cid:54) = µ d, , (4)where µ d, is the mean value of the d -th feature of cluster and µ d, is the mean value ofthe d -th feature of cluster . We then compute the two-sample test statistic with pooledestimate of the variance. After finding the p-values, we rank the features from lowest p-value(most significantly different between the two groups) to highest (least significantly differentbetween the two groups).Furthermore, we repeat Algorithm 1 for K = 3 , , , select the ’fastest’ and ’slowest’growth clusters in each case, and carry out the independent two-sample t-tests as describedabove for the same demographic features. This sensitivity analysis tests whether the de-mographic variables that are the most significantly different between the two groups areconsistent when we have a larger number of communities. Ultimately, we present the moststatistically significant demographic features. Data
For this section, we use data from Data Planet, a social science research database thatcompiles 12.6 billion U.S. and international datasets from over 80 sources. For our purposes,we look at the 2017 American Community Survey (ACS), the largest household survey in theU.S., conducted by the U.S. Census Bureau. We select 17 relevant features on a county-level,which are displayed and summarized in Table 4. Note that not all counties from JohnsHopkins CCSE data that were used in subsection 2.6 is available on Data Planet, thus theanalysis is done on counties for this section. Now, we are left with counties in Group16igure 9:
The left panel is a geographical representation of counties according to median household income.Blue dots are counties with less than $50,000 median annual household income and purple dots are countieswith more than $50,000 median annual household income. The right panel is a geographical representationof counties according to population density. Blue dots are counties with less than 150 persons per sq mileand purple dots counties with more than 150 persons per sq mile. counties in Group 2, which is still a close split like that of R seen in Table 2.Figure 10: Comparisons on the averages of a few features of the 2-clustered counties. The orange bars aremodel R , the gray bars are model A , and the blue bars are model A . roup 1 Group 2 F eature
Mean Median Std Dev Mean Median Std DevPopulation Density 275.775 159.260 394.059 913.182 289.610 3659.76Median Household Income 54431.6 52651.0 13386.3 58814.6 56074.0 16737.2 % Poverty 14.0656 13.3000 5.71511 13.4712 12.5000 5.83384 % % % households w 60 y/o and older 39.3294 39.1802 6.85032 38.7538 38.6947 6.03233 % w low access to stores 21.8723 21.3800 9.66507 20.8704 21.2500 9.81211 % low income w low access to stores 7.48273 6.85500 4.48466 6.60216 5.69000 4.52299 % households w low access to stores 2.69416 2.32000 1.63712 2.24196 1.92000 1.5007825 y/o and older w Bachelor’s /1,000 110.885 106.164 40.6371 122.654 118.332 43.6276 % White 80.8599 85.6959 15.0719 75.7593 79.6171 16.7522 % Black 11.3861 5.03010 14.6140 13.4737 7.90300 15.3919 % Asian 1.96190 1.22070 1.89220 3.67570 1.87900 5.28820No of bars 29.3313 16.0000 39.0755 55.2143 23.5000 96.9646No of grocery stores 42.5564 23.0000 67.0810 117.321 39.0000 16.5490No of restaurants 13.1345 8.00000 13.1383 14.9219 9.00000 16.5490 % take public transportation 0.41130 0.19870 0.76870 1.24690 0.32130 6.14170 Table 4: R clusters’ mean and median values for selected features for each community K = 2 . Model R corresponds to Algorithm 1 where we use the sample correlation matrix. Group 1 and 2 are the obtainedpartitions (cid:98) G and (cid:98) G , respectively. Population Density is the number of people per sq mile; median householdincome is in US dollars; % Poverty is the poverty rate: % 1-person households is the percentage of one-personhouseholds; % 5 or more person households is the percentage of five or more person households; % householdsw 60 y/o and older is the percentage of households that have one or more members who are 60 years oldor older; low access to stores is defined as living more than one mile (urban areas) or 10 miles (rural areas)from the nearest supermarket, supercenter, or large grocery store; /1,000 is per 1,000 persons; % take publictransportation is the percentage of all persons who work in a county and take public transportation to workevery day. All feature information is as of 2017. Results and Discussion
It is evident from Table 4 that community detection with R results in Group 2 (fast growth)containing counties with the highest mean and median population density by far. The meanand median household income are also higher for counties in Group 2. The mean number ofpersons 25 years old or older with Bachelor’s is noticeably greater for Group 2, which can oftencoincide with more urban areas that are more densely populated. However, it can also berelated to the number of universities in a particular area, as a higher number would exacerbatethe spread of COVID-19. The number of bars and grocery stores are also starkly differentamong the two groups. Moreover, the percentage of people who take public transportation to18 eature P-Value
F eature
P-Value % Asian
No of grocery stores
No of grocery stores % low income w low access to store
No of bars
Median Household Income % White % Poverty
Median Household Income % White
Population Density
25 y/o and older w Bachelor’s /1,000 % 1-person households
Population Density
Table 5:
Left table is R clusters’ p-values for independent two-sample t-tests for selected features betweenGroup 1 and Group 2 sorted from smallest to largest p-value. Right table is recent data (05/10/2020 -07/10/2020) R clusters’ p-values. The features in bold are the ones that are selected as significant featuresfor further analysis in section 4. F eature
P-Value
F eature
P-Value% w low access to stores 0.02683 % White 0.09115% low income w low access to store 0.08490 % Poverty 0.09423No of bars 0.12402 % households w low access to stores 0.12762No of restaurants 0.13868 Median Household Income 0.16226No of grocery stores 0.18422 Population Density 0.17828% Black 0.21146 25 y/o and older w Bachelor’s /1,000 0.32282% of households w 60 y/o and older 0.22823 % low income w low access to stores 0.36051Population Density 0.27201 No of restaurants 0.39537% 5 or more persons households 0.34981 % take public transportation 0.50781% take public transportation 0.38932 % Asian 0.51744% 1-person households 0.52795 % 1-person households 0.41733% households w low access to stores 0.46620 % Black 0.56017% Poverty 0.63207 % households w 60 y/o and older 0.57800% White 0.47153 No of grocery stores 0.72219Median Household Income 0.67262 % w low access to stores 0.7840425 y/o and older w Bachelor’s /1,000 0.71546 % 5 or more persons households 0.91679% Asian 0.92895 No of bars 0.93340
Table 6:
Left panel is A clusters’ p-values for independent two-sample t-tests for selected features betweenGroup 1 and Group 2 sorted from smallest to largest p-value. The right panel is A clusters’ p-values fortwo-sample t-tests for selected features between Group 1 and Group 2 sorted from smallest to largest p-value. A and A (see Supplementary Material for Tables 10 and 11) and in fact, the differencesin mean and medians between the two groups are significantly smaller for almost all features.For instance, the differences between population density means for the two groups of A and A are 191.609 and 228.966 people per sq mile, respectively - much smaller differences thanthat of R ’s clusters: 637.407. Furthermore, there isn’t consistency in these trends for mostfeatures; for example, Group 2 of A has the higher population density and median incomeaverages but also has the lower average number of bars, grocery stores, and restaurants.On the other hand, unlike what one would expect in terms of the relationship betweenthe number of one-person households and the spread of COVID-19, there is not much dif-ferentiation between the number of people in a household. Figure 10 displays the individualbar plots of average values for six features for R , A , and A . It is evident that the orangebars of R ’s clusters are plainly contrasted between Group 1 and 2, unlike those of A and A ’s clusters. It is very likely that too much information was lost in A and A during thetruncation process.Tables 5 and 6 contain the p-values for all of the 17 features. Evidently, the clustersformed with A and A do not contain as significant differences in terms of demographicvariables as they have much higher p-values across the board. As for R , the variables havea much greater explanatory power. The number of grocery stores and the number of barshave much lower p-values than that of the number of restaurants. Also, as expected, themedian household income is among the features with lower p-values; however, populationdensity does not appear as significant as expected. After conducting the same two-sidedt-tests for K = 3 , , on the two extreme groups, the seven statistically significant featuresfound are as follows: population density, median income, number of persons who are 25 yearsand older with Bachelor’s per 1,000 persons, percentage of the White population, percentageof the Asian population, number of bars, and number of grocery stores. These seven features20re consistently ranked in the top eight features for each K = 2 , , , based on p-values.These values form the demographic vector d i for each county i . The variables for bars andgrocery stores underscore the ease of transmission in locations with greater numbers of publicgathering spots, a characteristic evident in cities like New York City where most people chooseto convene at bars without much social distancing (before stricter lockdowns took place).After finding the growth communities and conducting t-tests to ascertain the significantfeatures for the latter phase of the pandemic in the U.S. (05/10/2020 - 07/10/2020), thefeatures with the lowest p-values diverge from those of earlier data, as presented in theright table of Table 5. Population density and median income are still among the mostmeaningful, with income being more significant, but variables such as the number of grocerystores, percentage of people with low access to stores, and the percentage living in povertyhave become much more significant. This suggests that at later stages of the pandemic,poverty and other income-related measures become more indicative and responsible for thedifferences in case growth among counties. Thus, the seven features for d i for this latterphase are the top seven variables in Table 5: number of grocery stores, % low income withlow access to stores, median household income, % poverty, % white, population density, and% 1-person households. The final section of our COVID-19 methodology is to predict a county’s growth trajectory afew days into the future. We propose a prediction methodology with the objective that givena new county, the new county’s key demographic features, and social distancing measures,we implement an algorithm that projects the new county’s future growth.Before going in-depth on the prediction models, it’s necessary to first define some impor-tant variables. Let l be the number of the days forward to be projected for a new county.21o build such a predictive model, let y i,t + l = log( x i,t + l ) − log( x i,t ) be county i ’s l -day for-ward log-growth rate, which is close to the growth rate x i,t + l − x i,t x i,t by Taylor’s expansion andnumerical verification, for t = 1 , ..., T i . Here, T i + l is the total number of days where county i has 12 or more cases. Recall the obtained partitions from Algorithm 1 (set of indices ofcounties that belong to group k ): (cid:98) G k = { i | (cid:98) Z i = k, i = 1 , ..., n } , where (cid:98) Z ∈ R n is the recoveredcommunity label vector. For a community k , and a county i ∈ (cid:98) G k , let d i ∈ R q be county i ’ssignificant feature vectors obtained from section 2.7, S i = [ s i , s i , ..., s iT i ] T ∈ R T i × be county i ’s three social distancing time series matrix (see subsection 4.4 for details about this data)and y i = [ y i, l , · · · , y i,T i + l ] T ∈ R T i be its l − day forward log case difference. Note that eachrow of S i , s it ∈ R , has three different social distancing metrics at time t .In summary, we have data { S i , y i : i ∈ (cid:98) G k } for training an l -day ahead predictive modelfor the k th community. Long Short-Term Memory (LSTM)
To enhance the effectiveness of the model, we take advantage of a special type of recurrentneural network (RNN): long short-term memory (LSTM) networks, which are designed fortime-series forecasting. Unlike feedforward neural networks (FNNs), RNNs produce an out-put that depends on a “hidden” state vector that contains information based on prior inputsand outputs. LSTMs builds on a simple, vanilla RNN to include a forget gate, input gate,and output gate for each module. Hence, it is able to “remember” information for longer timeperiods (lags). The output for an LSTM module at time t is as follows: h t = o t tanh( C t ) . (5)The components of h t are broken down as follows: f t = σ ( W f [ h t − , x t ] + b f ) , is the forgetgate output and W f and b f are its weights and biases, respectively. i t = σ ( W i [ h t − , x t ] + b i ) , is its input gate output and W i and b i are its weights and biases, respectively. The22ell state vector then gets updated by forgetting the previous memory through the forgetgate and adding new memory through the input gate: C t = f t C t − + i t ˜ C t , where ˜ C t =tanh( W C [ h t − , x t ] + b C ) . Subsequently, the output gate o t = σ ( W o [ h t − , x t ] + b o ) and W o and b o are its weights and biases, respectively. Here, σ is the sigmoid activation function.We also compare the LSTM’s performance with that of an FNN, namely an MLP (multi-layer perceptron). MLPs are a type of fully connected FNN first introduced and popularizedby Rumelhart et al. (1986), consisting of an input layer, output layer, and hidden layers inbetween, where the training process is done through backpropagation. The total input x s +1 i of a neuron i of layer s + 1 takes the form of x s +1 i = (cid:88) j h sij x sσj + b s +1 i , (6)where h sij is the weight for neuron j of the previous layer s to neuron i of layer s + 1 and b s +1 i is the threshold of layer s + 1 . x sσj = σ ( x si ) is the output from neuron j from the previouslayer s , where a nonlinear activation function σ ( · ) is applied to the input. Most commonactivation functions include sigmoid, tanh, or ReLU (rectified linear unit), where the ReLUoften learns faster in deeper networks. Prediction Models
The first prediction model, Algorithm 2 (which we will refer to as SD-LSTM), is a predictionprocedure that solely uses a nonlinear model (a neural network) to fit the data. The idea isto first train an LSTM for each of the K communities, and then given a new county, we selectthe corresponding fitted model for prediction from our repertoire with respect to its nearestneighbor county (in demographic variables, not geographical distance). That is, we applythe nearest neighborhood in demographic variables to classify the new county’s community,and use the model for that community to forecast the county’s cases. Specifically, for each23ommunity k ∈ { , ..., K } , we train an LSTM with the data { ( s it , y i,t + l ) T i t =1 , ∀ i ∈ (cid:98) G k } andthis depends on the numbers of steps forward, l , we are trying to forecast. For simplicityof notation, for community k , we denote all such data items for all counties i ∈ (cid:98) G k by { ( s t , y t + l ) , t ∈ (cid:98) G lk } and the fitted function by (cid:98) f lk ( · ) . Now the second part, the prediction, isthat given a new county i (cid:48) ’s demographic data d i (cid:48) and social distancing information S i (cid:48) =[ s i (cid:48) , s i (cid:48) , ..., s i (cid:48) T (cid:48) i ] ∈ R T i (cid:48) × , we first find its nearest neighbor county j = argmin j (cid:107) d i (cid:48) − d j (cid:107) and its associated community (cid:98) Z j and use its associated prediction model to predict (cid:98) y i (cid:48) ,t + l = (cid:98) f lk (cid:48) ( s i (cid:48) t ) , t = 1 , ..., T i (cid:48) with k (cid:48) = (cid:98) Z j . Algorithm 2 summarizes this method of prediction.To predict a future event, the above procedure gives a number of prediction methods.For example, to predict tomorrow’s outcome, we can use today’s social distancing data with l = 1 , or yesterday’s social distancing data with l = 2 , or the day before yesterday’s socialdistancing data with l = 3 , and so on. As verified later in Figure 12, it turns out that l = 4 is the best choice of lead, which align with the incubation period of the disease. Algorithm 2
SD-LSTM: LSTM Prediction
Part I: TrainingInput : The lead l for k ∈ { , ..., K } do Train LSTM (cid:98) f lk ( · ) using the data { ( s t , y t + l ) , t ∈ (cid:98) G lk } . end for return fitted LSTMs (cid:98) f lk ( · ) , k = 1 , ..., K . Part II: PredictionInput:
A new county i (cid:48) , d i (cid:48) ∈ R q , S i (cid:48) = [ s i (cid:48) , s i (cid:48) , ..., s i (cid:48) T (cid:48) i ] ∈ R T i (cid:48) × , (cid:98) Z and (cid:98) f lk ( · ) , k = 1 , ..., K from Part I. Find county i (cid:48) ’s nearest neighbor j = argmin j (cid:107) d i (cid:48) − d j (cid:107) . Select (cid:98) f lk (cid:48) ( · ) , where k (cid:48) = (cid:98) Z j . for t ∈ { , ..., T i (cid:48) } do (cid:98) y i (cid:48) ,t + l = (cid:98) f lk (cid:48) ( s i (cid:48) t ) . end for return (cid:98) y i (cid:48) = [ (cid:98) y i (cid:48) , l , (cid:98) y i (cid:48) , l , ..., (cid:98) y i (cid:48) ,T i + l ] T ∈ R T i (cid:48) .Algorithm 3 takes SD-LSTM a step further to include a linear component, namely, fittingthe linear model for each county first with residuals from each community then further24odeled by an LSTM. This idea is related to boosting or nonparametric estimation using aparametric start Fan et al. (2009), resulting in a semi-parametric fit. Again, the objective ofthe training part is to obtain K fitted models, one for each community, using semi-parametricregression techniques. More specifically, for county i with lead l , we first fit the followinglinear regression models y i,t + l = α li + ( s it ) T β li + ε i,t + l , t = 1 , ..., T i . (7)After fitting the linear regression models for every county i ∈ (cid:98) G lk , we obtain the residuals { (cid:98) (cid:15) i,t + l , t ∈ (cid:98) G lk } and save all the coefficients α li , β li for i ∈ (cid:98) G lk , k = 1 , ..., K . We then extractthe information further from { ( s t , (cid:98) (cid:15) t + l ) , t ∈ (cid:98) G lk } by fitting an LSTM to obtain the fitted (cid:98) g lk ( · ) .Then, for the prediction of the new county i (cid:48) , we follow the same steps as those in SD-LSTMbut the final prediction is instead adding the linear fit of the nearest neighbor county andthe LSTM fit of the community corresponding to the nearest neighbor county: (cid:98) y i (cid:48) ,t + l = α lj + ( s i (cid:48) t ) T β lj + (cid:98) g lk (cid:48) ( s i (cid:48) t ) , where k (cid:48) = (cid:98) Z j . We will refer to this model as SD-SP. The idea is summarized in Algorithm 3.We also include three other algorithms for comparison purposes. The first replaces theLSTM fit (cid:98) f lk ( · ) of community k in SD-LSTM with a linear model. This corresponds to fitting(7) without further boosting by an LSTM. For simplicity, we shall refer to this approach asthe SD-LM (social distancing linear model). The second one is to use both demographic andsocial distancing data to fit an LSTM. This approach is identical to SD-LSTM, but includesthe q = 7 significant demographic variables in Table 5 in addition to the three social distancingvariables. Similarly, we shall refer to this approach as the DSD-LSTM (demographic andsocial distancing LSTM). DSD-LSTM is expected to improve the performance of Algorithm 2due to the additional information from the demographic variables. The final model is similar25 lgorithm 3 SD-SP: Semi-parametric Prediction
Part I: TrainingInput : The lead l for k ∈ { , ..., K } do Fit the regression models (7) for i ∈ (cid:98) G lk and obtain the residuals { (cid:98) (cid:15) t + l , t ∈ (cid:98) G lk } . Train LSTM using { ( s t , (cid:98) (cid:15) t + l ) , t ∈ (cid:98) G lk } end for return fitted LSTMs (cid:98) g lk ( · ) and all α li and β li for i ∈ (cid:98) G lk , k = 1 , ..., K . Part II: PredictionInput
A new county i (cid:48) , d i (cid:48) ∈ R q , S i (cid:48) = [ s i (cid:48) , s i (cid:48) , ..., s i (cid:48) T (cid:48) i ] ∈ R T i (cid:48) × , (cid:98) Z and (cid:98) g lk ( · ) , α li and β li for i ∈ (cid:98) G lk , k = 1 , ..., K from Part I. Find county i (cid:48) ’s nearest neighbor j = argmin j (cid:107) d i (cid:48) − d j (cid:107) . Select α lj and β lj for county j . Select (cid:98) g lk (cid:48) ( · ) , where k (cid:48) = (cid:98) Z j . for t ∈ { , ..., T i (cid:48) } do (cid:98) y i (cid:48) ,t + l = α lj + ( s i (cid:48) t ) T β lj + (cid:98) g lk (cid:48) ( s i (cid:48) t ) . end for return (cid:98) y i (cid:48) = [ (cid:98) y i (cid:48) , l , (cid:98) y i (cid:48) , l , ..., (cid:98) y i (cid:48) ,T i + l ] T ∈ R T i (cid:48) .to SD-LSTM but instead of an LSTM, we use an MLP with two hidden layers (we will referto this model as SD-MLP). Implementation
For the LSTM, the optimization algorithm used is Adam with a learning rate of 0.01. Wealso test the performance of various lags to see which yields the highest out-of-sample R ,defined as follows for a given new county i (cid:48) and lead l : − (cid:80) T i (cid:48) t =1 ( y i (cid:48) ,t + l − (cid:98) y i (cid:48) ,t + l ) (cid:80) T i (cid:48) t =1 ( y i (cid:48) ,t + l − ¯ y i (cid:48) ,t + l ) , (8)where y i (cid:48) ,t + l is the observed value, (cid:98) y i (cid:48) ,t + l is the predicted value, and ¯ y i (cid:48) ,t + l = 1 /T i (cid:48) (cid:80) T i (cid:48) t =1 y i (cid:48) ,t + l ,serving as the baseline predictor. The average, median, and standard deviation of the R values are then taken across all counties in the testing sample. Additionally, for any model26nvolving an LSTM, up to the minimum length ˜ T = min i =1 ,...,N T i is taken for each county sincethe LSTM needs each sample to have uniform time steps. Therefore, T i = ˜ T for each county i in the case of SD-LSTM, SD-SP and the DSD-LSTM model. For information regarding thehidden layers used and input shapes in the neural network models, see Table 7. No. of hidden layers Type and no of nodes Input shapeSD-LSTM 1 LSTM, 10 N × ˜ T × SD-SP 1 LSTM, 10 N × ˜ T × DSD-LSTM 1 LSTM, 10 N × ˜ T × SD-MLP 2 Dense, 10, 10 (cid:80) Ni =1 T i × Table 7:
Number of hidden layers, the type and number of nodes of each hidden layer, and input shape ofeach model that contains an NN.
Due to the nature of neural networks and considering the relatively small sample size, weconduct five-fold cross-validation to evaluate the learning models. We divide all the countiesinto train-test splits, where the correlation matrix is re-calculated on only the training set.Then, for each K = 1 , ..., , Algorithm is executed on the training set for that particularsplit. Hence, we have sets of results for each model (five for each of the five train-testsplits). Data
Social distancing data is courtesy of Unacast and its COVID-19 Social Distancing Scoreboard.The scoreboard tracks mobile device movement and has three metrics that quantify the levelof social distancing people in a particular county are practicing. The first metric is thepercentage change in total distance traveled, averaged across all devices, compared to a pre-Corona baseline. The second is the percentage change in the number of visitations to non-essential places compared to a pre-Corona baseline. For these two metrics, the pre-Coronabaseline of a county on a particular day is defined as the average of the four correspondingpre-weekdays (at least four weeks before the day). For example, for Monday 3/30, the pre-Corona baseline of the first metric is the average of the first metric for the four Mondays:27/10, 2/17, 2/24, and 3/2. The final metric is the rate of human encounters as a fractionof the pre-Corona national baseline. The pre-Corona national baseline for this metric is theaverage of the metric taken over four weeks that immediately precede the COVID-19 outbreak(02/10/2020 - 03/08/2020) as defined by Unacast. Since this data starts at 02/25/2020 whichis after the start of the Coronavirus cases data (01/22/2020), we perform prediction on theperiod 02/25/2020 - 7/10/2020, which is the start of the “initial phase” until the end of the“recent phase”. Also note that not all counties from Johns Hopkins CCSE data and DataPlanet are available at Unacast’s database so out of the counties from subsection 3.1,this section is performed on counties.Figure 11:
Out-of-sample R boxplots for all counties using Model 1 (SD-LSTM), Model 2 (SD-SP), Model3 (SD-LM) and Model 4 (DSD-LSTM) for K = 1 , , , , . The results are based on l = 4 and the period02/25/2020 - 7/10/2020. M odel
Mean Median Std Dev Mean Median Std DevModel 1, K = 1 K = 2 K = 3 K = 4 K = 5 K = 1 -2.0229 -1.8148 0.9800 -2.9663 -1.9824 7.8948Model 2, K = 2 -1.8886 -1.7337 0.8901 -2.9613 -1.8356 8.1548Model 2, K = 3 -1.8819 -1.6990 0.8471 -2.9393 -1.7641 8.0641Model 2, K = 4 -1.8647 -1.7513 0.8371 -2.9220 -1.7512 8.1715Model 2, K = 5 -1.8673 -1.7500 0.8016 -2.9281 -1.7872 8.0991Model 3, K = 1 -0.1227 -0.0271 0.3236 -0.1458 -0.0311 0.4000Model 3, K = 2 -0.1148 -0.0260 0.2941 -0.1586 -0.0296 0.5145Model 3, K = 3 -0.1138 -0.0252 0.2936 -0.1563 -0.0276 0.5054Model 3, K = 4 -0.1095 -0.0239 0.2771 -0.1659 -0.0302 0.6044Model 3, K = 5 -0.0991 -0.0246 0.2559 -0.1542 -0.0317 0.5320Model 4, K = 1 K = 2 K = 3 K = 4 K = 5 K = 1 -0.1894 -0.0394 0.4866 -0.2098 -0.0430 0.4551Model 5, K = 2 -0.1507 -0.0365 0.3679 -0.1886 -0.0411 0.4241Model 5, K = 3 -0.1432 -0.0383 0.3538 -0.1729 -0.0415 0.3932Model 5, K = 4 -0.1318 -0.0415 0.3072 -0.1362 -0.0413 0.3277Model 5, K = 5 -0.1206 -0.0374 0.2633 -0.1560 -0.0462 0.3741Table 8: 02/25/2020 - 7/10/2020 in-sample and out-of-sample R for Model 1 (SD-LSTM),Model 2 (SD-SP), Model 3 (SD-LM), Model 4 (DSD-LSTM), and Model 5 (SD-MLP) for K = 1 , , , , . The average values for mean, median and standard deviation are takenfor each of the folds. For K = 1 , we assume that all counties belong to one group so wetake all counties in the training data to train the neural network. The results are based on l = 4 and a five-fold cross-validation. of the total counties are used as training data(in-sample) and counties are used as testing data (out-of-sample). Results and Discussion
Among the four prediction models we implemented using the county-level social distancingmeasures (see subsection 4.4), for K = 1 , , , Model 4 (DSD-LSTM) slightly outperforms29 n-Sample Out-of-Sample F eature
Mean Median Std Dev Mean Median Std DevTrial 1 0.2349 0.4842 0.9238 0.4523 0.4956 0.3505Trial 2 0.2764 0.5569 0.9678 0.3304 0.4995 0.7916Trial 3 0.2921 0.5628 0.9442 0.1163 0.4980 1.4862Trial 4 0.2900 0.5594 0.9677 0.3410 0.5454 0.7049Trial 5 0.3404 0.5430 0.8943 -0.0455 0.4367 1.3895Median 0.2900 0.5569 0.9442 0.3304 0.4980 0.7916
Table 9: 02/25/2020 - 7/10/2020 random assignment in-sample and out-of-sample R forModel 1 (SD-LSTM), K = 2 . Each trial is completed via randomly assigning each testcounty of one of the train-test splits to either community 1 or community 2.Model 1 due to the use of the seven additional demographic variables. Model 1 (SD-LSTM)proves to result in the highest average and median out-of-sample R for K = 4 , . Models 2(SD-SP) and 3 (SD-LM) have much poorer performance across the board, which implies thatthese two models are worse than a horizontal line fit. It is also worth mentioning that theneural network correction part of Model 2 is incredibly hard to tune to be able to outperformthe linear model Model 3 on its own. In this case, not only was it not able to enhance Model3’s results, Model 2’s correction actually worsened the model’s predictive ability. Othernonparametric methods other than a neural network were also used (such as support vectorregression) but all had a similar lackluster effect, implying that boosting or enhancing thelinear estimator with a nonlinear estimator is not beneficial in this case. Model 1’s and Model4’s superiority suggests a nonlinear effect that the LSTM was able to extract, but the linear,semi-parametric, and MLP were unable to do so.For Models 1 and 4, stratifying the communities through our method does make a dif-ference in-sample since increasing K improves the models’ mean and median in-sample R .However, this is not the case for out-of-sample as K = 1 produces the best results (no het-erogeneity) and the out-of-sample R continues to drop from K = 2 to . It is reasonable toconclude that the decrease in sample size for each community training (e.g. K = 1 uses all counties to train while K = 5 uses on average / th of that number to train each commu-30ity) is hurting the model’s ability to take advantage of the heterogeneity embedded in thecommunities. Thus, since neural networks have an advantage in large sample size settings,the effect of the reduction in sample size for larger K s outweighs the community differencecaptured by community detection (Algorithm 1). We also include Model 5 (an FNN withtwo hidden layers, each with 50% dropout) to contrast the LSTM with. The performance issimilar to Model 3 in that it is no better than a constant fit. The advantage of the LSTMis highlighted here since the output is dependent on previous computations, unlike the FNNthat assumes the inputs (as well as outputs) are independent of each other. As COVID-19cases are sequential information, the LSTM is clearly preferable to predict with. See Table8 for the detailed breakdown by model and by the number of clusters K . Figure 11 containsthe out-of-sample R box plots for the four models with K = 1 , , , , .Figure 12: The left panel is the average out-of-sample R for Model 1, K = 1 and Model 4, K = 1 for l = 1 , , , , , , for one train-test split. The right panel is average out-of-sample R for the same models,where one social distancing feature is left out each time. Both panels are of the phase 02/25/2020 - 7/10/2020and based on five-fold cross-validation. To ascertain whether using information from community detection still plays a role despite K = 1 being the best setting for out-of-sample prediction, we randomly assign each testingcounty to an existing community instead of using the nearest neighbor method. As shown inTable 9, after repeating this five times for Model 1, K = 2 , the median in-sample R valuesare much lower compared to that of the same model in Table 8 (median of 0.5569 vs 0.6372,respectively). Albeit a smaller difference, the out-of-sample median 0.4980 is also smallerthan the 0.5437 in Table 8. This demonstrates that community detection can still categorize31he nature of different counties’ growth trajectories but this effect is likely outweighed by thediminishing sample size as K increases.Also note that before obtaining the prediction results for each algorithm, the hyperpa-rameter of the appropriate lead was chosen by comparing the average R values for eachlead. The left panel of Figure 12 presents the median out-of-sample R vs l = 1 , ..., for thetwo best models Model 1, K = 1 , and Model 4, K = 1 as examples. Since out-of-sample R is plateaus after a four-day lead, we fixed l = 4 as a larger lead would decrease precisionand it is important to be consistent with studies that show the median incubation period ofCOVID-19 is 4-5 days (Guan et al., 2020; Lauer et al., 2020). Furthermore, anything longerthan a week or so is rarely used in epidemiological and sociological studies. In addition, wequantify the social distancing feature importance by averaging the out-of-sample R whenwe leave each feature out one at a time. Evidently, the right panel of Figure 12 suggeststhat although there is no distinct drop in performance, leaving out feature 1 (percent changein total distance traveled) results in the largest decline in R whereas leaving out feature 2(percent change in the number of visitations to non-essential places) results in the smallestdecline. By utilizing spectral clustering to recover communities, we develop a framework to detectCOVID-19 communities and discover meaningful interpretations of the clusters. We use thecorrelation matrix instead of the canonical Laplacian as it offers more meaningful insight andmore distinct clusters. The resulting communities are distinct in the nature of their respectivegrowth trajectories and there are several demographic variables that further distinguish thesegrowth communities. Singling out the significant demographic features that have explanatorypower of a county’s growth community membership, we discover that not all of these variablesare intuitive when it comes to their role in impacting COVID-19 cases.32fter modeling and interpreting historical disease progression, we turn to study futuregrowth trajectories by incorporating social distancing information. We are able to reliablypredict the logarithmic trends in case growth through the use of LSTMs and also verify thatthe counties are far from homogeneous - the obtained communities contain crucial informationnecessary for prediction in-sample. As for the LSTM’s out-of-sample predictive power, theeffect of the decline in sample size when increasing the stratification of counties into morecommunities dominates the heterogeneity between counties’ growth curves that communitydetection uncovers. However, after comparing results to randomly assigning counties todifferent communities, the method we propose still demonstrates that using the communitydetection results boosts the models’ predictive performance.We do, therefore, acknowledge that there could be other latent features that we did notcapture in this study and that the three social distancing metrics used here may not paintthe complete picture. Furthermore, we do not address the effect of government interventionat given time points that may have altered the disease progression. These could all bepoints that can be further investigated. Despite these potential shortcomings, the analysisconducted on the first phase of the disease here can also be compared to the second phase,which we are currently experiencing. As the U.S. and many other countries are witnessing aneven more extraordinary uptick in cases again, we foresee several possible future applicationsof our study, including to other contagious disease outbreaks. Another interesting futurework is to utilize the confidence distribution framework Xie et al. (2011) to combine studiesfrom independent data sources from different countries.
Acknowledgements
We would like to thank Unacast Inc. for providing us with their extensive social distancingdata. The work was in part supported by NSF Grants DMS-2013789, DMS-1712591 andDMS-2034022, and NIH grant 2R01-GM072611-15.33 eferences
Emmanuel Abbe. Community detection and stochastic block models: recent developments.
The Journal of Machine Learning Research , 18(1):6446–6531, 2017.Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, and Yiqiao Zhong. Entrywise eigenvectoranalysis of random matrices with low expected rank.
Annals of Statistics , 48(3):1452–1474,2020.Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, and Aarti Singh. Noise thresholdsfor spectral clustering.
Advances in Neural Information Processing Systems , 24:954–962,2011.Rebecca A Betensky and Yang Feng. Accounting for incomplete testing in the estimationof epidemic parameters.
International Journal of Epidemiology , 49(5):1419–1426, 07 2020.ISSN 0300-5771.M.R. Brito, Edgar Chavez, Adolfo Quiroz, and J.E. Yukich. Connectivity of the mutualk-nearest-neighbor graph in clustering and outlier detection.
Statistics and ProbabilityLetters , 35:33–42, 1977.Christian Brownlees, Guðmundur Stefán Guðmundsson, and Gábor Lugosi. Communitydetection in partial correlation network models.
Journal of Business & Economic Statistics ,pages 1–11, 2020+. doi: 10.1080/07350015.2020.1798241.Jingchun Chen and Bo Yuan. Detecting functional modules in the yeast protein? proteininteraction network.
Bioinformatics , 22:2283–2290, 2006.Jennifer Beam Dowd, Liliana Andriano, David M. Brazel, Valentina Rotondi, Per Block,Xuejie Ding, Yan Liu, and Melinda C. Mills. Demographic science aids in understandingthe spread and fatality rates of covid-19.
Proceedings of the National Academy of Sciences ,117(18):9696–9698, 2020. doi: 10.1073/pnas.2004911117.Jianqing Fan, Yichao Wu, and Yang Feng. Local quasi-likelihood with a parametric guide.
Annals of statistics , 37(6):4153, 2009.Jianqing Fan, Jianhua Guo, and Shurong Zheng. Estimating number of factors by adjustedeigenvalues thresholding.
Journal of the American Statistical Association , pages 1–33,2020.Duccio Fanelli and Francesco Piazza. Analysis and forecast of covid-19 spreading in china,italy and france.
Chaos, Solitons & Fractals , 134:109761, 2020. ISSN 0960-0779. doi:https://doi.org/10.1016/j.chaos.2020.109761.Donniell E. Fishkind, Daniel L. Sussman, Minh Tang, Joshua T. Vogelstein, and Carey E.Priebe. Consistent adjacency-spectral partitioning for the stochastic block model when themodel parameters are unknown.
SIAM Journal on Matrix Analysis and Applications , 34:23–39, 2020. 34arius Gilbert, Giulia Pullano, Francesco Pinotti, Eugenio Valdano, Chiara Poletto, Pierre-Yves Boëlle, Eric D’Ortenzio, Yazdan Yazdanpanah, Serge Paul Eholie, Mathias Altmann,Bernardo Gutierrez, Moritz U G Kraemer, and Vittoria Colizza. Preparedness and vul-nerability of african countries against importations of covid-19: a modelling study.
Lancet ,395(10227):871—877, 2020. ISSN 0140-6736. doi: 10.1016/s0140-6736(20)30411-6.Wei-jie Guan, Zheng-yi Ni, Yu Hu, Wen-hua Liang, Chun-quan Ou, Jian-xing He, Lei Liu,Hong Shan, Chun-liang Lei, David S.C. Hui, Bin Du, Lan-juan Li, Guang Zeng, Kwok-Yung Yuen, Ru-chong Chen, Chun-li Tang, Tao Wang, Ping-yan Chen, Jie Xiang, Shi-yueLi, Jin-lin Wang, Zi-jing Liang, Yi-xiang Peng, Li Wei, Yong Liu, Ya-hua Hu, Peng Peng,Jian-ming Wang, Ji-yang Liu, Zhong Chen, Gang Li, Zhi-jian Zheng, Shao-qin Qiu, JieLuo, Chang-jiang Ye, Shao-yong Zhu, and Nan-shan Zhong. Clinical characteristics ofcoronavirus disease 2019 in china.
New England Journal of Medicine , 382(18):1708–1720,2020. doi: 10.1056/NEJMoa2002032.Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic block-models: First steps.
Social Networks , 5(2):109 – 137, 1983. ISSN 0378-8733. doi:https://doi.org/10.1016/0378-8733(83)90021-7.Hyokyoung G Hong and Yi Li. Estimation of time-varying reproduction numbers underlyingepidemiological processes: A new statistical tool for the covid-19 pandemic.
PloS one , 15(7), 2020.Zixin Hu, Qiyang Ge, Li Jin, and Momiao Xiong. Artificial intelligence forecasting of covid-19in china. 02 2020.Hohjin Im, Christopher Ahn, Peiyi Wang, and Chuansheng Chen. An early examination:Psychological, health, and economic correlates and determinants of social distancing amidstcovid-19. 2020.Jiashun Jin. Fast community detection by score.
Annals of Statistics , 43(1):57–89, 02 2015a.doi: 10.1214/14-AOS1265.Jiashun Jin. Fast community detection by score.
The Annals of Statistics , 43(1):57–89,2015b.Adam Kucharski, Timothy Russell, Charlie Diamond, Yang Liu, John Edmunds, SebastianFunk, Rosalind Eggo, Fiona Sun, Mark Jit, James Munday, Nicholas Davies, Amy Gimma,Kevin Zandvoort, Hamish Gibbs, Joel Hellewell, Christopher Jarvis, Samuel Clifford, BillyQuilty, Nikos Bosse, and Stefan Flasche. Early dynamics of transmission and control ofcovid-19: a mathematical modelling study.
The Lancet Infectious Diseases , 20, 2020. doi:10.1016/S1473-3099(20)30144-4.Stephen A Lauer, Kyra H Grantz, Qifang Bi, Forrest K Jones, Qulu Zheng, Hannah Meredith,Andrew S Azman, Nicholas G Reich, and Justin Lessler. The incubation period of 2019-ncov from publicly reported confirmed cases: estimation and application.
Annals of internalmedicine , 172(9):577–582, 2020. 35ing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic blockmodels.
Annals of Statistics , 43(1):215–237, 2015.Giuseppe Lippi, Camilla Mattiuzzi, Fabian Sanchis-Gomar, and Brandon M. Henry. Clinicaland demographic characteristics of patients dying from covid-19 in italy vs china.
Journalof Medical Virology , 92(10):1759–1760, 2020.Zhihua Liu, pierre magal, Ousmane Seydi, and Glenn Webb. Predicting the cumulative num-ber of cases for the covid-19 epidemic in china from early data.
Mathematical Biosciencesand Engineering , 17, 2020. doi: 10.1101/2020.03.11.20034314.Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis andan algorithm. In
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS ,pages 849–856. MIT Press, 2001.Liangrong Peng, Wuyue Yang, Dongyan Zhang, Changjing Zhuge, and Liu Hong. Epidemicanalysis of covid-19 in china by dynamical modeling. medRxiv , 2020. doi: 10.1101/2020.02.16.20023465.J C Rajapakse, S Gupta, and X Sui. Fitting networks models for functional brain connectivity.In , pages515–519, 2017.Weston C. Roda, Marie B. Varughese, Donglin Han, and Michael Y. Li. Why is it difficultto accurately predict the covid-19 epidemic?
Infectious Disease Modelling , 5:271 – 281,2020. ISSN 2468-0427. doi: https://doi.org/10.1016/j.idm.2020.03.001.Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensionalstochastic blockmodel.
The Annals of Statistics , 39(4):1878–1915, 2011.David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representationsby back-propagating errors.
Nature , 323:533–536, 1986.Jieli Shen, Regina Y Liu, and Min-ge Xie. i fusion: Individualized fusion learning.
Journalof the American Statistical Association , 115(531):1251–1267, 2020.Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.
IEEE Transactionson Pattern Analysis and Machine Intelligence , 22(8):888–905, 2000.Unacast. Unacast social distancing dataset, 2020. URL .U von Luxburg. A tutorial on spectral clustering.
Statistics and Computing , 17:395–416,2007.Linda Wang and Alexander Wong. Covid-net: A tailored deep convolutional neural networkdesign for detection of covid-19 cases from chest x-ray images, 2020.36tanley Wasserman and Carolyn Anderson. Stochastic a posteriori blockmodels: Con-struction and assessment.
Social Networks , 9(1):1 – 36, 1987. ISSN 0378-8733. doi:https://doi.org/10.1016/0378-8733(87)90015-3.Minge Xie, Kesar Singh, and William E Strawderman. Confidence distributions and a uni-fying framework for meta-analysis.
Journal of the American Statistical Association , 106(493):320–333, 2011.Zifeng Yang, Zhiqi Zeng, Ke Wang, Sook-San Wong, Wenhua Liang, Mark Zanin, PengLiu, Xudong Cao, Zhongqiang Gao, Zhitong Mai, Jingyi Liang, Xiaoqing Liu, Shiyue Li,Yimin Li, Feng Ye, Weijie Guan, Yifan Yang, Fei Li, Shengmei Luo anddYuqi Xie, BinLiu, Zhoulang Wang, Shaobo Zhang, Yaonan Wang, Nanshan Zhong, and Jianxing He.Modified seir and ai prediction of the epidemics trend of covid-19 in china under publichealth interventions.
Journal of Thoracic Disease , 12:165–174, 2020.Chuansheng Zheng, Xianbo Deng, Qing Fu, Qiang Zhou, Jiapei Feng, Hui Ma, Wenyu Liu,and Xinggang Wang. Deep learning-based detection for covid-19 from chest ct using weaklabel. medRxiv , 2020a. doi: 10.1101/2020.03.12.20027185.Nanning Zheng, Shaoyi Du, Jianji Wang, He Zhang, Wenting Cui, Zijian Kang, Tao Yang,Bin Lou, Yuting Chi, Hong Long, Mei Ma, Qi Yuan, Shupei Zhang, Dong Zhang, Feng Ye,and Jingmin. Xin. Predicting covid-19 in china using hybrid ai model.
IEEE Transactionson Cybernetics , 50(7):2891–2904, 2020b. 37 upplementary Material: Clustering with adjacency ma-trices
Algorithm 4 outlines the spectral clustering procedure with adjacency matrices A and A and Tables 10 and 11 present Groups 1 and 2’s mean, median, and standard deviation of the17 features. Algorithm 4
Normalized Laplacian spectral clustering
Input
Similarity matrix S ∈ R n × n and K clusters to obtain. Obtain the adjacency matrix A . Compute the normalized, symmetric Laplacian matrix L = I − D − / AD − / . Compute the smallest K eigenvectors in absolute value u , ..., u K of L and construct (cid:98) U ∈ R n × K be the matrix with the eigenvectors as columns. Normalize rows of (cid:98) U to have unit norm 1 to get (cid:98) U norm . Clusters the rows of (cid:98) U norm with k -means. return Partition (cid:98) G , ..., (cid:98) G K of the nodes. Group 1 Group 2
F eature
Mean Median Std Dev Mean Median Std DevPopulation Density 501.873 191.700 1038.69 693.482 210.820 3575.14Median Household Income 56442.4 53812.0 14879.8 56872.8 54539.0 15776.4 % Poverty 13.8581 13.3000 5.63048 13.6742 12.6500 5.92796 % % % households w 60 y/o and older 38.7766 38.8296 6.56590 39.2932 39.0888 6.33139 % w low access to stores 20.6451 20.5700 9.36471 22.0766 22.1900 10.0694 % low income w low access to stores 6.77756 6.14000 4.23293 7.29463 6.42000 4.78238 % households w low access to stores 2.42279 2.03000 1.53641 2.50819 2.22000 1.6321725 y/o and older w Bachelor’s /1,000 116.308 112.763 43.5637 117.340 112.172 41.5866 % White 78.6730 82.7486 15.2783 77.9016 81.8775 16.9468 % Black 11.8118 6.07323 14.0567 13.0591 6.64656 15.9400 % Asian 2.78995 1.41062 3.98255 2.76574 1.49532 4.19016No of bars 47.7015 20.0000 90.2649 38.6136 18.0000 60.4562No of grocery stores 89.3181 30.0000 211.193 73.2670 29.0000 130.915No of restaurants 14.9178 9.00000 15.9277 13.2853 8.00000 14.1468 % take public transportation 0.70840 0.24056 1.68443 0.96613 0.24533 6.04509 Table 10: A clusters’ mean and median values for selected features for each community K = 2 .Model A corresponds to Algorithm 4 where we use the k -nearest neighbors graph ( k = 7 ) as A . roup 1 Group 2 F eature
Mean Median Std Dev Mean Median Std DevPopulation Density 712.520 211.600 3570.04 483.554 189.310 1076.73Median Household Income 57683.7 54873.0 15640.9 55628.1 53739.0 14959.5 % Poverty 13.3879 12.2500 5.92222 14.1455 13.4000 5.61381 % % % of households w 60 y/o and older 38.7801 38.3954 6.74182 39.2953 39.3441 6.13981 % w low access to stores 21.7977 21.9050 9.85552 20.9322 20.5100 9.62812 % low income w low access to stores 7.09803 6.40000 4.67050 6.97775 6.13000 4.37390 % households w low access to stores 2.40026 2.20000 1.51320 2.53181 2.11000 1.6533525 y/o and older w Bachelor’s /1,000 119.289 113.846 44.2816 114.350 109.250 40.6497 % White 78.9164 82.9167 15.8005 77.6482 81.4956 16.4603 % Black 12.1216 6.12919 14.6064 12.7612 6.38394 15.4733 % Asian 2.73678 1.51268 3.88658 2.81899 1.41062 4.28174No of bars 41.2933 17.0000 79.4118 44.9514 21.0000 73.3541No of grocery stores 79.5744 29.0000 170.279 82.6123 30.0000 179.320No of restaurants 14.2222 8.00000 15.6239 13.9210 8.00000 14.4344 % take public transportation 0.95377 0.23521 5.91278 0.71807 0.24792 2.01788 Table 11: A clusters’ mean and median values for selected features for each community K = 2 . Model A corresponds to Algorithm 4 where we use the (cid:15) -neighborhood graph ( (cid:15) = 0 . ). Group 1 and 2 are theobtained partitions (cid:98) G and (cid:98) G , respectively., respectively.