[PDF] Analyzing the State of COVID-19: Real-time Visual Data Analysis, Short-Term Forecasting, and Risk Factor Identification

Abstract

The COVID-19 outbreak was initially reported in Wuhan, China, and it has been declared as a Public Health Emergency of International Concern (PHEIC) on 30 January 2020 by WHO. It has now spread to over 180 countries, and it has gradually evolved into a worldwide pandemic, endangering the state of global public health and becoming a serious threat to the global community. To combat and prevent the spread of the disease, all individuals should be well-informed of the rapidly changing state of COVID-19. To accomplish this objective, I have built a website to analyze and deliver the latest state of the disease and relevant analytical insights. The website is designed to cater to the general audience, and it aims to communicate insights through various straightforward and concise data visualizations that are supported by sound statistical methods, accurate data modeling, state-of-the-art natural language processing techniques, and reliable data sources. This paper discusses the major methodologies which are utilized to generate the insights displayed on the website, which include an automatic data ingestion pipeline, normalization techniques, moving average computation, ARIMA time-series forecasting, and logistic regression models. In addition, the paper highlights key discoveries that have been derived in regard to COVID-19 using the methodologies.

Full PDF

11 COVID-19 Real-Time Tracker and Analytical Report

Jiawei Long

Department of BiostatisticsUCLA Fielding School of Public HealthUniversity of California, Los AngelesEmail: [email protected]

Abstract - While the COVID-19 outbreak was reported to first originate from Wuhan, China, it has been declared as aPublic Health Emergency of International Concern (PHEIC) on 30 January 2020 by WHO, and it has spread to over 180countries by the time of this paper was being composed. As the disease spreads around the globe, it has evolved into aworld-wide pandemic, endangering the state of global public health and becoming a serious threat to the globalcommunity. To combat and prevent the spread of the disease, all individuals should be well-informed of the rapidlychanging state of COVID-19. In the endeavor of accomplishing this objective, a COVID-19 real-time analytical trackerhas been built to provide the latest status of the disease and relevant analytical insights. The real-time tracker is designedto cater to the general audience without advanced statistical aptitude. It aims to communicate insights through variousstraightforward and concise data visualizations that are supported by sound statistical foundations and reliable datasources. This paper aims to discuss the major methodologies which are utilized to generate the insights displayed on thereal-time tracker, which include real-time data retrieval, normalization techniques, ARIMA time-series forecasting, andlogistic regression models. In addition to introducing the details and motivations of the utilized methodologies, the paperadditionally features some key discoveries that have been derived in regard to COVID-19 using the methodologies.

Index Terms - COVID-19, Real-Time Tracker, Common Symptoms, Data Visualization, Hypothesis Testing, ARIMATime-Series Forecast, Penalized Logistic Regression

1. I

NTRODUCTION

The COVID-19 real-time tracker primary includes features such as odometers of the latest status of COVID-19cases, trend analysis, and prediction of COVID-19 cases in 185 different countries, informative visualizations of the mostcommon symptoms and risk factors, as well as patient demographic distributions. Subsequent sections will be providing abrief description of every major feature, discussing relevant methodologies behind the feature, as well as highlightingselective findings from the feature.Link to the COVID-19 real-time tracker: https://peterljw.shinyapps.io/covid_dashboard/

2. F

EATURE : O

VERVIEW

The section of the COVID-19 real-time tracker contains two different pages to separately highlight the mostcurrent states of COVID-19 in the states within the U.S. and countries around the globe (See figure 1 and figure 2). Thetwo pages share the same features and elements. The top of the page has three odometer boxes to display the totalconfirmed cases, total deaths, and total recovered cases along with their respective daily new counts. The bottom half ofthe page contains a user-interactive control panel and a display window. The users are able to apply populationnormalization or log transformation to the visualizations in the display window through the widgets in the control panel.Figure 1. U.S. OverviewFigure 2. World OverviewThe display window has the viewing options of heat map visualizations (Figure 3), time-series line plots (Figure 4),or data tables (Figure 5). All visualizations have interactive features such as tooltip and zooming, and all tables can beinteractively sorted by clicking on column names. The heat map visualization allows the users to quickly assess theseverity of COVID-19 in different geographical locations while the time-series line plot shows a comparison of themost affected regions on a standardized time scale. The data table provides the users with the flexibility to explore andsearch for data of their interest.Figure 3. World Confirmed Cases Heat Map Figure 4. World Confirmed Cases Time-Series Line PlotFigure 5. World Data TableThe purpose of this feature is to provide the audience with an aggregated view of the severity of COVID-19 indifferent locations and inform the audience of the latest status of the disease at a first glance. The options of applyinglog transformation and population normalization allow the audience to observe the state of COVID-19 from differentperspectives while the interactive table allows the audience to explore specific metrics of their interest.

To ensure the accuracy and the reliability of the tracker’s content, the website’s server retrieves the newestdata from the COVID-19 data repository by the Center for Systems Science and Engineering at Johns HopkinsUniversity when any user tries to load the web page. The data repository is regulated by the Johns Hopkins UniversityCenter for Systems Science and Engineering and supported by the ESRI Living Atlas Team and the Johns HopkinsUniversity Applied Physics Lab. According to the documentation of the repository, the data source is being updated inreal-time numerous times throughout the day, and the validity of the data is verified by researchers at Johns HopkinsUniversity. The content displayed in the overview feature is therefore derived from a real-time and reliable data source.To achieve real-time data retrieval, the webserver contains a protocol to download and ingest the data sourcefrom the CSSE data repository by Johns Hopkins University when a request is sent to the server when a user tries toaccess the web page in a browser. When the server receives such requests, it will attempt to download the data bygetting the current date and accessing the data source with an updated URL. Once the data file is downloadedsuccessfully, it will be ingested and stored temporarily on the server. Subsequently, the pre-specified R script will readthe data and preprocess it into different data frames to support the insights to be derived on the web page. If thedownload were to be unsuccessful due to unforeseen circumstances, the web server will load up the most recent datafile that it has ingested previously to support the content on the web page. The server will also log such errors so thatthey could be handled to improve the robustness of the tracker. Figure 6 provides a visual summary of the real-timedata retrieval process. Figure 6. Real-Time Data Retrieval Visualization

The control panel on the page allows the users to apply log transformation and population normalization (i.e.cases per million) to the data, which interacts with the corresponding visualizations of heat map and time-series lineplot. When the user turns log scale switch on, the logarithmic function with a base of 10 will be used as a deterministicmathematical function to be applied to each point in a data set. That is, for every data point i x , its value will bereplaced by )(log ii xy  . Such transformation significantly improves the interpretability and the appearance ofvisualizations. The choice of using the logarithmic function is based on the nature of exponential growth associatedwith pandemic and the relatively large differences in the raw counts of cases across different locations in the laterstages of a pandemic. The effect of log transformation is demonstrated in figure 7.Figure 7. Before (Left) and After (Right) Log TransformationOn the other hand, while population normalization does not necessarily improve the appearance of thevisualizations, it alters the interpretation of the visualization by accounting for the population of each region. Such aperspective is beneficial because each country or state could vary significantly in its population. Assessing the numberof cases per million provides a more robust estimation of the severity of the COVID-19 in each region rather thansolely observing the total counts. To achieve population normalization, global country-level population data and state-level population data of the U.S. are preprocessed and stored on the server, and they will be joined to the retrieved datato produce the corresponding visualizations. Precisely, the normalization is applied in the following manner, for everydata point i x , its value will be replaced by  jii pxy , where p j denotes the population of country j . Theeffect of log transformation is demonstrated in figure 8.Figure 8. Before (Left) and After (Right) Population NormalizationIn addition, the time-series line plots have built-in timescale standardization. Rather than comparing the time-series data with respect to date, the plot compares them with respect to the number of days after the spread of thedisease reaches a certain magnitude. Since the time frames of outbreaks are different in every region, it will be hard tocompare the severity of the disease in each region in separate time frames. Hence, the application of timescalestandardization will help to standardize the time-series data into a universal time scale. In conjunction with populationnormalization, the audience will be able to compare regions which have the fastest spread of COVID-19. There are a number of noticeable findings which surface from the visualizations and data presented from theoverview feature. In spite of the fact that the U.S. has the most confirmed cases of COVID-19 around the globe andaccounts for approximately 30% of the global total confirmed cases, it no longer tops the chart with the application ofpopulation normalization. Among countries which have over a million population, the most severe countries are Qatar,Singapore, Bahrain, Spain, and Ireland. Similarly, if we sort the world data table by confirmed cases per million, wecan see the U.S. is not in the top 10 countries. As shown, there are numerous smaller countries that are having tougherstruggles with COVID-19 as they have higher confirmed cases per million and tend to have less advanced medicalsupplies to combat the disease. Such countries gain much less exposure and discussion on mainstream news coveragedue to their limited presence in the global economy, but they may be suffering from much greater severity of COVID-19. Similarly, within the United Sates, the severity of COVID-19 has been quietly climbing in some smaller states. Forexample, Rhode Island, Connecticut, Delaware, Louisiana, and Nebraska are among the top 10 states with the highestconfirmed cases per million. While California has the fourth highest confirmed number of cases in the United Statesand has received relatively high amount of media coverage, it is only standing at the 32th position among all the stateswhile measuring confirmed cases per million. While it is true that regions with higher population is more vulnerable tobigger outbreaks, regions with small populations cannot be overlooked and also deserve some amount of attention.

3. F

EATURE : T

REND B Y C OUNTRY

The control panel on the page allows the users to specify the period of days which the moving averageaggregation uses to draw the moving average curve. Moving average is an aggregating calculation to analyze datapoints by creating a series of averages of different subsets of the full data set. In this case, the method of simplemoving average is used to compute the values of the moving averages. To calculate a simple moving average, let t X be the number of new cases at time t, then a simple moving average at mt  is computed as    mnmi inmmmmm Xnn XXXXSMA )1()1(21

By computing a series of simple moving averages, we are able to smooth out short-term fluctuations in thenumber of daily new cases and highlight longer-term trends or cycles. This is especially useful in determining theconstantly changing state of the COVID-19 outbreak in a particular region.

Auto ARIMA model (i.e. auto.arima() function in the forecast R package) is used to implement a 5-days time-series prediction on the number of daily new cases. In summary, for every given time-series, the script willautomatically fit the best ARIMA model to the data by using AIC, AICc, or BIC scores as the basis of judgment. Thefollowing discussion briefly introduces the notion of the ARIMA model and provides the necessary backgroundinformation for the subsequent discussion of the prediction mechanism.Suppose that is a time-series data, where are real numbers and is an integer, then the model can be written as follows, = It can also be written as t = t , where represents the lag operator, represents the error terms, represents the moving average part parameters,and represents the autoregressive part parameters. However, the general assumption about the error terms is thatthey are: i. Sampled from a normal distribution with zero mean;ii. Identically distributed variables;iii. Independent variables.Assuming that t , which is a polynomial, has a unit root, i.e. a factor whose multiplicityis d, it can be written as follows: t = t t Φ t The above polynomial factorization property is expressed by an t process with t t . It isgiven by the following, t Φ t = t Looking at the above output, one can think of it as an t process case that has the autoregressivepolynomial with the unit roots being equivalent to t , and the above is an indication that there does not exist a widesense stationary for an model with t .Generalizing the above, the definition of t obtained is: t Φ t = t The drift of the above defined t process is: Φ There are three major parameters t of the ARIMA model, in which case p represents the lag order orrather the number of lag observations that are included in the t model, d represents the degree ofdifferencing or rather the number of times that the raw observations are differenced, and q refers to the order ofmoving average or rather the size of the moving average window.In addition, the ARIMA model assumes that the input time-series data is univariate and stationary. Stationarityimplies that the time series’ properties are independent of the time when they were captured. In other words, the datahas a constant mean and variance. If not, the data needs to be transformed before one can use the ARIMA model. Theauto.arima() function automatically determines the appropriate order of differencing using the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test.As mentioned earlier, auto.arima() performs model selection based on the AIC, AICc or BIC scores. AIC is anabbreviation of the Akaike Information Criterion, and AICc refers to the corrected AIC. BIC is known as Schwarzinformation criterion, and it is the acronym of the Bayesian information criterion. All of them are very useful modelevaluation metrics.AIC provides a means for model selection. This is particularly because it estimates the quality of each model,from a collection of data models, relative to each of the other models. In other words, it serves as an estimator of out-of-sample prediction error, thus estimating the statistical models’ relative quality for a given data set. Given aparticular statistical data model with being the model’s estimated parameters number and the maximum likelihoodfunction value. Then the model’s ⯮ value is given by ⯮ t ln Thus, as assessed by likelihood function, AIC rewards goodness of fit.On the other hand, BIC is closely related to the ⯮ model because it is also partly based on the likelihoodfunction. Equally, ⯮ is also a criterion for model selection, particularly amongst a “finite” set of models. The mostpreferred model is one with the lowest BIC. The formal definition of ⯮ is as follows, ⯮ t ln ln In both settings, k = the number of parameters that the model has estimated; n = the sample size, or number of observations, or number of data points in x; L = the model’s likelihood function’s maximized value.Thus, for ARIM models, the evaluation metrics can be computed as follows, ⯮ t log ⯮ܥ t ⯮ ⯮ t ⯮ log , where k = the ARIMA model’s intercept q = the moving average part order p = the autoregressive part order L = the model’s likelihood function’s maximized value.For every given time-series, auto.arima() chooses the parameters which give the lowest AIC, AICc, or BIC,and forecast the values for the next 5 days. As a part of the output from the auto.arima() function, the 95% intervals aretaken to plot the transparent orange ribbon around the mean prediction. The 95% confidence intervals for ARIMAforecasts are computed as ݄ " ݄ , where ݄ refers to the variance of ݄ From this feature, we can observe various trends and patterns as countries around the globe have been reactingdifferently to mitigate the spread of COVID-19. For example, European countries such as Spain, Italy, and Germanyhave implemented a relatively strict country-level lockdown policy, and it is being reflected from their trend plots(Figure 9). Figure 9. Trends in Spain, Italy, and GermanyIran also implemented similar lockdown policies, and the country has begun to reopen throughout April afterseeing a significant drop in its number of new cases. However, the country has been hit by a new surge of COVID-19cases in May (Figure 10). Figure 10. Trend in IranIn contrast, the U.S.’s reaction at the federal level has been relatively slow-moving and incoherent. Despitethat the number of new cases has been gradually decreasing, the rate of decrease is relatively small to countries that0implement strict lockdown policies. We may be able to expect a similar new surge of cases in the U.S. if the countrywere to reopen without cautions in the near future.Figure 11. Trend in U.S.

4. F

EATURE : C

OMMON S YMPTOMS

This section of the COVID-19 real-time tracker contains an interactive visual summary of the most commonsymptoms associated with the disease (Figure 12). Due to data quality issues and the uncertain nature of the disease, itis difficult to estimate the true prevalence of the symptoms among infected patients. Hence, their prevalence measureis standardized to an 0-to-10 scale, represented by the horizontal axis of the plot.Figure 12. Common Symptoms of COVID-191

Since the symptom variable from the patient-level data contains descriptive sentences of a patient’s symptoms(e.g. “Moderate fever 38.5 o C, cough, strong headache”), we would have to apply natural language processingtechniques such as n-gram tokenization to transform and preprocess the data. The goal is to convert the descriptivesentences into a set of binary indicator variables, as shown by the simple example in figure 13.Figure 13. Example of Converting Sentences to Binary IndicatorsThe process of word tokenization refers to splitting a sample of text into words or phrases. In addition, n-gramtokenization refers to tokenization that splits the text into phrases which contain n words. For example, unigramtokenization will turn the sentence “he has shortness of breath” into [he, has, shortness, of, breath] while trigramtokenization will turn the sentence into [he has shortness, has shortness of, shortness of breath] .As an attempt to collect all of the recorded symptoms in the dataset, we can apply n-gram tokenization toevery descriptive sentence and compute the frequency of each token for n = {1, 2, 3, 4}. As anticipated, we can obtaina list of the most common symptoms from the symptom records by looking through the processed output from n-gramtokenization (Figure 14). Figure 14. Examples of N-Gram OutputAfter obtaining a comprehensive list of symptoms, we can then create a dictionary of phrases for eachsymptom and loop through all descriptive sentences to see if they contain any phrase in any dictionary. For example,the dictionary for cough is [cough, coughing] , and any sentence that contains cough or coughing will take the value of1 for the cough’s binary indicator variable and 0 if otherwise. By the end of the loop, we would have finishedconverting the descriptive sentences into a set of binary indicator variables in the format that is shown in figure 12.2 After the application of n-gram tokenization to create all the necessary binary indicator variables, we can thenobtain the aggregated count of patients for every symptom by calculating the columnar sums of the binary indicatorvariables. To better communicate the level of prevalence of each symptom, we can apply min-max normalization tothe columnar sums to standardize each data point into a scale of 0 to 10. For any symptom’s columnar sum, S i , itsscaled value could be computed as follows, minmax min,  SS SSS iscaledi

The scaled value is an abstract representation of the symptom’s prevalence relative to other symptoms, and itdoes not reflect the true prevalence of the symptom among patients who have been infected with COVID-19.

A logistic regression model is built to identify risk factors that could potentially increase a patient’s likelihoodof dying from COVID-19. Once we have formed all the binary indicator variables for symptoms, we can use themalong with other variables as predictors to build a logistic regression model on a patient’s outcome, which is eitheractive/recovered or death. The following discussion briefly introduces the notion of the logistic regression model andprovides the necessary background information for the subsequent discussion of hypothesis testing of the model’scoefficients.Let y be a binary output variable, taking on values )1,0(  , analogous to a patient’s outcome, and we wouldlike to model the output y as a linear function of the input variables, ),...,( p xxx  . As a way to represent )|( xyE sothat its value )1,0(  , we can apply the sigmoid function as follows, xx TT eexyP     x T exyP    We can invert the transformation above to obtain the logit function, xxyP xyPxg T    )),|1(1 ),|1(log()|( Suppose we are fitting a logistic regression model to a dataset of n observations, )},(),...,,{( nn yxyxD  ,we can express the condition likelihood of a single data observation as yiiiyiiiii xyPxyPxyP   ),|0(),|1(),|(  , where iTiT xxii eexyP     , and iT xii exyP    )),|0(log()1()),|1(log(),|(  iiini iii xyPyxyPyYXl    To find the maximum likelihood estimators of β , we would take the gradients of the expression above and setthem equal to 0 to find the solutions. k iiiiik iini iiik xyPxyPyxyPxyPyYXl       ),|0(),|0( 1)1(),|1(),|1( 1),|(    ni iikik xyPyixYXl ),|1((),|(  Because of the nonlinear nature of the parameters, there is no closed-form solution to these equations and theymust be solved iteratively using numerical methods such as the Newton-Raphson method. The details of the methodwill not be discussed in this report as we will focus on the problem of testing hypotheses of the coefficients. Supposewe have successfully estimated all the coefficients,  ˆ , using numerical methods, we can then use hypothesis testing toevaluate if the predictors have statistically significant associations with the output variable.Based on the large-sample distribution of the maximum likelihood estimator, we can apply the Wald test forthis problem. For any coefficient, we have the following hypothesis testing set up,

0: 0:  jj HH  Concerning the significance of the coefficient, we can calculate the ratio of the estimate to its standard error asfollows, )1,0(~)ˆ(ˆ)ˆ(ˆ NSESEz j   , where )ˆ(  SE is calculated by taking the inverse of the estimated information matrix.Going back to our case and applying logistic regression to the COVID-19 patient dataset, we have thefollowing logit function, sorenesssepsis pneumonialphlegmialmyamalaisearction headachefeverfatiguedizzinessdiarrheacough arrhythmiaanorexiathroatsoreshockspeticnoserunny failureheartbreathofshortnessdistresschestfailureyrespirator syndromedistresyrespiratordiseasechronicfemalesexageit i             Before we interpret the model, we would need to first ensure the quality of the model. To evaluate the qualityand accuracy of the logistic regression model, we can use k-fold cross-validation. The goal of cross-validation is toevaluate the model's ability to generalize and predict unseen data, in order to flag potential problems such as4overfitting or selection bias. It provides insights on the model’s level of robustness and generalization on new data thatis not a part of its training data. One round of cross-validation involves partitioning a sample of data intocomplementary subsets, performing the analysis on one subset (i.e. the training set), and validating the analysis on theother subset (i.e. the validation set or testing set). To reduce variability, we can repeat this procedure k times byinitially partitioning the data into k subsets. Figure 15 demonstrates a visual summary of the process when k = 5.Figure 15. 5-Fold Cross ValidationUsing the caret and glmnet packages in R, we are able to perform a 5-fold cross-validation to compute theoverall accuracy and the ROC curve of the logistic regression model. After repeating the same procedures for logisticregression with L1 and L2 regularization, it was found that the regular logistic regression had the best performance.According to the results, the model has an overall accuracy of 0.900 with a standard deviation of 0.030. The ROCcurve is shown in figure 16. Figure 16. ROC Curve5After confirming the quality of the model, we can apply the same model onto the whole dataset and interpretthe coefficient table from the output. The output coefficient table is displayed in the table below.

From this section, we can derive that fever and cough are the two major symptoms associated with COVID-19.Moreover, pneumonia and shortness of breath as discovered to be significant risk factors as they potentially increaseone’s likelihood of dying from the disease, controlling for other factors.

5. P

ATIENT D EMOGRAPHICS

This section of the COVID-19 real-time tracker shows a summary visualization of the distributions ofdemographic characteristics of selective patients (Figure 17).Figure 17. Summary Visualizations of Demographics Characteristics of Selective Patients6

To determine if there is statistically significant difference in the age of two patient groups, active/recovered ordeath, we can conduct a two-sample t-test. Two-sample t-test is a hypothesis testing method to compare twocontinuous-data distributions. More precisely, it tests to determine if the means of two continuous-data distributionsare equal. There are a number of assumptions that need to be satisfied in order to use the two-sample t-test properly,and they are listed as follows,i. The data are continuous (not discrete)ii. The data follow the normal probability distributioniii. The variances of the two populations are equal, if not, use un-pooled variances to calculate the test statisticsiv. The two samples are independent. There is no relationship between the individuals in one sample as comparedto the otherv. Both samples are simple random samples from their respective populations. Each individual in the populationhas an equal probability of being selected in the sampleAssumption (i) is satisfied as the value of age is continuous. However, assumptions (iv) and (v) may not bevalid due to potential data quality issues such as missing data. We will presume they are satisfied and proceed withcautions. For assumptions (ii), we can validate the data’s normality using QQ plots as follows,Figure 18. QQ Plots of Ages of Active/Recovered Patients (Left) and Dead Patients (Right)The data points appear to be decently consistent with the quantiles of a normal distribution. For assumption (iii), wecan apply the F-test of equality of variances as follows, )1,1(~::   mnFSSFHH

YX YX YX   , where S X denotes and sample standard deviation of age of active/recovered patients, S Y denotes and sample standarddeviation of age of dead patients, n and m denote the sample sizes of two groups. After conducting the hypothesis test,we obtained a p-value of 0.00097. Thus, we have sufficient evidence to reject the null hypothesis and conclude thevariances of the two groups are unequal at the alpha level of 0.05.Consequently, we proceed to conduct a two sample t-test with un-pooled variances. The steps aredemonstrated below,7 )(~:1 :0 vtmSnS YXtHH YX YX YX    ,where   mmSnnS mSnSv

YX YX

As a result, we obtained a p-value that is approximately 0 that allowed us to reject the null hypothesis at the alpha levelof 0.05. Hence, we have sufficient evidence to conclude that the average age of patients who are active or recovered isdifferent from the average of patients who have died from COVID-19.

To determine if there is statistically significant association between a patient’s gender and a patient’s outcome,we can conduct a Chi-Square test as a test of association on a 2x2 contingency table. There are a number ofassumptions that need to be satisfied in order to use the two-sample t-test properly, and they are listed as follows,i. The data in the cells should be frequencies, or counts of casesii. The levels (or categories) of the variables are mutually exclusiveiii. Each observation is independent of all the othersiv. The value of the expected values should be 5 or more in at least 80% of the cells, and no cell should have anexpected of less than oneAssumptions (i) and (ii) are met since we are observing counts of patients who are either male or female, and eitheractive/recovered or deceased. In addition, assumptions (iv) is satisfied as shown by the 2x2 contingency table below.We will presume assumption (iii) to hold true and proceed.After filtering the data to create a subset of patient data with recorded genders and outcomes, we can form thefollowing 2x2 contingency table, Active/Recovered DeathMale 299 132Female 213 71We cam calculate the Chi-square test statistic as follows, )1,1(~)(

21 1 , 2,,2      crXE EOX ri cj ji jiji ,where nCRjEi ji /)(,  As a result, we obtained a p-value of 0.1025, which does not provide sufficient evidence in rejecting the nullhypothesis. Thus, we failed to reject the null hypothesis at the alpha level of 0.05 and conclude that a patient’s genderhas no statistically significant association the patient’s outcome.8

From this section, we can derive that the average age of patients who are active or recovered is different fromthe average of patients who have died from COVID-19, and older populations are more vulnerable to a negativeoutcome of the disease. In contrast, gender does not appear to have a strong association with the outcome. Bothfindings are consistent with the results of the logistic regression model from the previous section as both age andgender are incorporated as a part of the model.

6. C

ONCLUSION

This research presented the latest trends of COVID-19 across different regions and insights of COVID-19’ssymptoms and patient demographics as visualized in the real-time COVID-19 tracker. In addition, the research divesinto deeper details of the methodologies behind the real-time COVID-19 tracker, which include real-time data retrieval,data transformation and normalization, time-series forecast with ARIMA model, text mining techniques, and logisticregression model. However, we need to be cautious about accepting the conclusions as there are potential data qualityissues, in which case the patient-level data has a substantial amount of missing data and erroneous entries. To verifythe findings in this research, we can try reproducing the derived insights when we have access to an updated datasettowards the end of the pandemic.During a global-level pandemic such as COVID-19, it is paramount for the public to have access to the lateststatus of the outbreak and be well-informed of relevant insights of the disease. A platform such as a real-time COVID-19 tracker will assist the public community to disseminate accurate and reliable insights into the spread of COVID-19.The research and effort behind the tracker are motivated by the social responsibility to spread awareness to thecommon public by providing scientific-based data analysis, prediction, and relevant findings. This paper and researchproject is still ongoing research as many more investigations regarding COVID-19 can be carried out. It will serve asan initial step to unravel the many uncertainties that revolve around this global pandemic.9R

EFERENCES [1]

Nau, Robert. “ Introduction to ARIMA Models. ” Introduction to ARIMA Models,people.duke.edu/~rnau/Slides_on_ARIMA_models--Robert_Nau.pdf. [2] “Estimation and Hypothesis Testing for Logistic Regression.” Estimation and Hypothesis Testing for LogisticRegression, courses.washington.edu/b515/l13.pdf. [3] “Two-Sample T-Test.” Two-Sample T-Test, ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Two-Sample_T-Test.pdf. [4]

McHugh, Mary. “The Chi-Square Test of Independence.” 2013. [5]

Shumway, Robert H, and David S. Stoffer. Time Series Analysis and Its Applications: With R Examples. NewYork: Springer, 2006. [6][6]