[PDF] An Exploratory Study of COVID-19 Information on Twitter in the Greater Region

Abstract

The outbreak of the COVID-19 leads to a burst of information in major online social networks (OSNs). Facing this constantly changing situation, OSNs have become an essential platform for people expressing opinions and seeking up-to-the-minute information. Thus, discussions on OSNs may become a reflection of reality. This paper aims to figure out the distinctive characteristics of the Greater Region (GR) through conducting a data-driven exploratory study of Twitter COVID-19 information in the GR and related countries using machine learning and representation learning methods. We find that tweets volume and COVID-19 cases in GR and related countries are correlated, but this correlation only exists in a particular period of the pandemic. Moreover, we plot the changing of topics in each country and region from 2020-01-22 to 2020-06-05, figuring out the main differences between GR and related countries.

Full PDF

AAn Exploratory Study of COVID-19 Information onTwitter in the Greater Region

Ninghan Chen a , Zhiqiang Zhong a , Jun Pang a,b, ∗ a Faculty of Sciences, Technology and Medicine, University of Luxembourg, L-4364Esch-sur-Alzette, Luxembourg b Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg,L-4364 Esch-sur-Alzette, Luxembourg

Abstract

The outbreak of the Coronavirus disease (COVID-19) leads to an outbreakof pandemic information in major online social networks (OSNs). In the con-stantly changing situation, OSNs are becoming a critical conduit for peoplein expressing opinions and seek up-to-the-minute information. Thus, socialbehaviour on OSNs may become a predictor or reﬂection of reality. Thispaper aims to study the social behaviour of the public in the Greater Re-gion (GR) and related countries based on Twitter information with machinelearning and representation learning methods. We ﬁnd that tweets volumeonly can be a predictor of outbreaks in a particular period of the pandemic.Moreover, we map out the evolution of public behaviour in each countryfrom 2020/01/22 to 2020/06/05, ﬁguring out the main diﬀerences in pub-lic behaviour between GR and related countries. Finally, we conclude thattweets volume of anti-contiguous measures may aﬀect the eﬀeteness of thegovernment policy.

Keywords:

COVID-19, online social media, spatio-temporal analysis, topicmodelling, pandemic information, human behaviour ∗ Corresponding author

Email addresses: [email protected] (Ninghan Chen), [email protected] (Zhiqiang Zhong), [email protected] (Jun Pang)

Preprint submitted to Online/Mobile Social Networking at the time of COVID-19August 14, 2020 a r X i v : . [ c s . S I] A ug . Introduction On January 20th 2020, the World Health Organisation (WHO) declared aglobal health emergency over the COVID-19 outbreak. Later, on March 12th2020, WHO announced the COVID-19 outbreak as a pandemic. The out-break of the COVID-19 Coronavirus leads to an outbreak of pandemic infor-mation in major online social networks (OSNs), including Twitter, Facebook,Instagram, and YouTube [1]. In the middle of a massive COVID-19 outbreakand constantly changing situation, OSNs are becoming a critical conduit forpeople to seek up-to-the-minute and local information. Moreover, due tophysical isolation and social distancing, people spend much more time onOSNs — engaging in expressing opinions, encouraging others, openly lam-basting mismanagement, and voicing vitriol, etc. On the one hand, socialbehaviour on OSNs may become a predictor or reﬂection of reality. On theother hand, the related information diﬀusion over OSNs can strongly in-ﬂuence peoples behaviour, and thus have an impact on the eﬀectiveness ofcontrol and protective measures deployed by the governments.There is a growing body of research that links OSNs activities to COVID-19. Some existing results have already shown that OSNs conversations can bea leading indicator of COVID-19 cases [2, 3], discussions on OSNs can be cat-egorised into multiple speciﬁc topics [4, 5, 6, 7] and OSNs may help to designmore eﬃcient pandemic models for social behaviour and to implement moreresponsive government communication strategies [1, 8, 9]. However, thereare three main problems with the existing research. First of all, researcheswith geographic data are conducted through rough processing of the loca-tion information [2, 10]. Second, the current topic modelling study is mostlyfocused on a relatively long period (weeks or months) [6, 1], which does notprovide a precise representation of how topics change from day to day andthe existing studies mainly focus on general characteristics of user behaviour.Third, shared information on OSNs at a global or country level [11, 2, 10]are coarser in terms of geographic dividing. When analysing the COVID-19information on Twitter by geographic locations, it cannot be ignored thatCOVID-19 performs in a way which is inescapably spatial [12]. Existing stud-ies [13, 14] have shown that whether it is Spanish ﬂu, Ebola or COVID-19,the geographic changing of contagion reveals that economic, logistical, andﬂowsoriented relationality is intrinsically linked to the transmission pattern https://bit.ly/39EMOtY

2f infectious diseases [15]. Hence, research results focusing only on politicalsovereign states would be biased.To ﬁll this gap, we innovatively introduce relational urbanisation, the ef-ﬂorescent idea from human geographic, that city orientation is inﬂuenced bythe network of materials, capital, information and culture, and reinforced byﬁnancialised capitalism [16] to determine the study area, and the GreaterRegion (GR), a typical relational urbanisation product with Luxembourg atits centre and including adjacent regions of Belgium, Germany and France(i.e., Wallonia, Saarland, Lorraine, Rhineland-Palatinate and the German-speaking Community of Belgium) is chosen as the representative in our casestudy and we deﬁne the neighbouring countries mentioned above as the re-lated countries of GR.GR has the highest number of cross-border commuters in Europe, withapproximately 250 ,

000 commuters per day . This makes GR a particular andclassic example: virus spreads with high mobility, while the whole businessmodel in GR requires a large number of cross-border workers to sustain.With the implementation of a set of policies including border closures andthe progression of the pandemic, the people living in the GR area are aﬀectedin economy, daily life, travel, and other aspects. This study is divided intotwo parts (Sections 4 and 5) to address following three main questions: RQ1

Can the Twitter posts volume be a predictor of COVID-19 daily casesduring a long-term period in GR and its related countries?

RQ2

Whether there are certain patterns of social behaviour on OSNs atdiﬀerent periods of the pandemic and how GR, as a relational urbani-sation product, diﬀers from other countries in terms of social behaviouron OSNs during the pandemic?

RQ3

Whether more attention to the pandemic and prevention measures inthe early stages of the pandemic will impact the eﬀectiveness of controland protective measures deployed by the governments?To answer these questions, we collected 51 , ,

639 tweets from Twitter,which are posted by 15 , ,

266 Twitter users from 2020-01-22 to 2020-06-05globally. Among them are 1 , ,

308 posts posted by 41 ,

690 users in GRand its related countries. To investigate

RQ1 , basic reproductive rate R and eﬀective reproductive rate R ( t ) in epidemiology [17] are introduced to https://bit.ly/2P6NLSm RQ2 and

RQ3 .The main contributions in this paper are fourfold.(I) We generate a novel Twitter dataset from 2020-01-022 to 2020-06-05which contains users with locations labelled in GR, and related coun-tries including Luxembourg, France, Germany and Belgium, and theCOVID-19 related tweets and conversations posted by the users. Thisdataset will be shared with the public to advance related research.(II) A Spatio-temporal analysis is carried out to ﬁgure out how the COVID-19 cases are correlated with the Twitter posts during a long-term pe-riod. We ﬁnd that tweets volume only can be a predictor of outbreaksduring the early period of the pandemic. Before R ( t ) value peaks, thereis a spike of public concern about the pandemic, which is the best timeto conduct the pandemic precaution advocacy [21].(III) We ﬁnd that for countries with a long interval (average 27 . R ( t ) <

2. Related Work

Some existing results have already shown that social media conversationscan be a leading predictor of a new pandemic cases [2, 22, 3], and in manycountries tweets increase in volume before the number of conﬁrmed casesincreases. Studies have shown that anti-contagion policies can signiﬁcantlyand substantially reduce the spread of COVID-19 [23, 24, 25], and the eﬀectof policies on the mitigation of spread varies, inﬂuenced by factors includ-ing culture, demographic information, socio-economic status and nationalhealth systems, where changes in public knowledge may aﬀect the impact ofthe policies. If the public adjusts their behaviour in response to other newinformation not related to the policies, such as from online sources, this maychange the spread of COVID-19 [23].Researches of public behaviour patterns of the pandemic have been con-ducted based on data from smart devices [8], search index [26, 27], andCOVID-19 related conversations on Twitter. Bento et al. [9] mention that,there is a spike in searches for basic information about Covid-19 when theﬁrst case was announced in each state in the United States, but the ﬁrst casereport does not trigger discussions about policy and daily life. Topic mod-elling, an unsupervised approach that detects latent semantic structure [4] iswidely used. Cinelli et al. [1] extract topics with word embedding on a globalscale, making the conclusion that social media may help to design more eﬃ-cient epidemic models for social behaviour and to implement more eﬃcientcommunication strategies. The LDA model is used by Medford et al. [5]and Ordun et al. [6] to analyse the topics in early period of the pandemic.Sharma, et al. [7] use character embedding [28] and Term Frequency InverseDocument Frequency (TF-IDF) word distribution with manual inspection fortopic modelling. However, LDA, a bag-of-words approach, which is widelyused to identify latent subject information in a large-scale document collec-tion or corpus, has some drawbacks: it needs large corpus to train, ignorescontextual information and performs mediocrely in handling short texts [29].As a result, these studies extract the topic over certain time periods, and thetime granules are too coarse to accurately reﬂect the trend of the topics.5 . Data Description

In this section, we brieﬂy describe how to collect COVID-19 tweets andCOVID-19 daily cases information for GR and the related countries.

Twitter, one of the most prominent online social media platform, hasbeen used extensively during the pandemic. In this study, 51 , ,

639 tweetsposted by more than 15 million Twitter users are retrieved. The data col-lection consists of the following steps. First of all, we collect posts withCOVID-19 related keywords from 2020/01/22 to 2020/06/05 and the user-id and location of users who posted them based on Chen et al.’s work [30]with the Twitter Streaming API. Secondly, as the user location informationwe collected so far is user-deﬁned, it is not necessarily a true location, normachine-parseable, so the fuzzy location context is processed into real loca-tion information by leveraging geocoding APIs, Geopy and ArcGis Geocod-ing . After getting the valid geographic location of each user, we select userswhich are located in the GR, Luxembourg, France, Germany, and Belgium toform the dataset. Table 1 gives an example of the ﬁnal dataset, and Table 2shows the summary of the collected tweet data of GR, Luxembourg, France,Germany, Belgium and the global. Attribute Description ExampleTweet id A unique identiﬁer for a Tweet 12319668395******Full text Text of a tweet RT @******:The Diamond princess isa UK ship managed by the US.UK should Be Responsible.

Table 1: A sample of our COVID-19 Twitter dataset https://bit.ly/3gfW2PP https://bit.ly/3f9OUDa Table 2: Summary of our COVID-19 Twitter dataset

For the COVID-19 cases data, the dataset published by European Centerfor Disease Prevention and Control allows us to obtain COVID-19 dataincluding daily cases, deaths and locations for the country we selected. Asthere is no oﬃcial COVID-19 data published for GR, which is composedof Luxembourg, Wallonia in Belgium, Saarland and Rhineland-Palatinate inGermany and Lorraine in France, when counting daily cases and deaths in theGR, we add up all the data for the cities and regions mentioned above fromthe datasets published by corresponding countries as the ﬁnal GR regiondata. It should be noted that in France, the number of daily new cases is notavailable at the region level, and deaths, hospitalisations, departures datahave been published only since March 18, 2020. So the data for Lorraineis counted as zero until March 18, 2020, and the sum of hospitalisations,hospital departures and deaths is considered as the total number of cases onthat particular day.

4. Correlation between COVID-19 Cases and Tweets Volume

To investigate whether tweets volume can be a predictor of daily casesduring the pandemic (

RQ1 ), we introduce basic reproductive rate R andeﬀective reproductive rate R ( t ) in epidemiology to slice the period as theresearch covers a long duration, and a Spatio-temporal analysis of correlationsbetween tweets volume and daily cases in each period is conducted by PearsonCorrelations ( P C ). https://bit.ly/3jYhefx https://bit.ly/2ErDii7,https://bit.ly/3gaGGMm,https://bit.ly/33c8CM8 igure 1: Daily tweets volume and COVID-19 new cases aa On 3rd June, France published a revision of data that lead to a negativenumber of new cases, see https://bit.ly/33c8CM8 for the original news. ar 7 Mar 17 Mar 27 Apr 6 Apr 16 Apr 26 May 6 May 16 May 26Jun 5Date0123R(t) 012R(t) 123R(t) 024R(t) 0123R(t) GR Luxembourg Belgium France Germany Figure 2: Eﬀective reproductive rate ( R ( t )) R ( t ) -based time division R is the expected number of cases arising directly from a single casein a population where all individuals are susceptible to infection [17] and R ( t ) represents the average number of new infections caused by an infectedperson at time t . If R ( t ) >

1, the number of cases will increase, e.g. at thebeginning of an epidemic. When R ( t ) = 1, the disease is endemic, and when R ( t ) <

1, the number of cases will decrease. For the calculation of real-time R ( t ), we use a Bayesian approach [31] with Gaussian noise to calculate thetime-varying R ( t ) based on daily new cases, which is also the oﬃcial methodfor calculating R ( t ) in Luxembourg. In this cases, While the study of R of COVID-19 is still ongoing, in this research we use the R estimated byWHO which ranges between 1 . . R ( t ) for GR, Luxembourg, Belgium, France, https://github.com/k-sys/covid-19/ https://bit.ly/3fgOQkY R Luxembourg Beilgium France Germany010203040506070Days 30 27 496 30 654 8 30 20 24 48 30 24 24 43 30 25 649Pre-peak period Free-contagious period Measures period Decay period

Figure 3: Total days for each pandemic period and Germany are shown in Figure 2. Here, we divide the pandemic intofour periods, which are:

Pre-peak period (30 days before R ( t ) < . Free-contagious period (1 . < R ( t ) < . Measures period (1 < R ( t ) < . Decay period ( R ( t ) < Free-contagious period, R ( t ) is based on R , taking into account the results ofthe prevention measures [32], so during this period the measures taken toprevent the spread of the pandemic have not shown results or have failedto stop the spread of the pandemic, and COVID-19 still spreads during thisperiod with an R reproductive rate. The precise time duration of thesepandemic periods for each country and region are summarised in Table 3. Pre-peak Free-contagious Measures period

Decay periodGR 2/14 - 3/15/2020 3/15 - 3/21/2020 3/21 - 4/17/2020 4/17 - 6/05/2020Luxembourg 2/19 - 3/20/2020 3/20 - 3/24/2020 3/24 - 4/01/2020 4/01 - 6/05/2020Belgium 2/04 - 3/05/2020 3/05 - 3/25/2020 3/25 - 4/18/2020 4/18 - 6/05/2020France 2/05 - 3/06/2020 3/06 - 3/30/2020 3/30 - 4/23/2020 4/23 - 6/05/2020Germany 1/29 - 2/28/2020 2/28 - 3/24/2020 3/24 - 4/02/2020 4/02 - 6/05/2020

Table 3: Time duration of the four pandemic periods for GR, Luxembourg, Belgium,France and Germany

The exact numbers of days of each pandemic period are shown in Figure 3for the countries and GR. The

Free-contagious period in Luxembourg andGR is particularly short (4& 6 days) compared to the other countries (24-20 days). Existing research indicates that the diﬀerences in periods in eachregion and country are inﬂuenced by a variety of factors including government10olicy [33, 34], population density [35], mobility [36, 37] and so on. In thispaper, we will analyse it from the impact of the social behaviour on OSNs inSection 5 (

RQ3 ). RQ1

To answer

RQ1 , we hypothesise that daily tweets volume can be predictiveof daily cases and we calculated the relationship between them by

P C , wherea

P C with a large absolute value means greater relation strength. The resultsare shown in Table 4. A lag refers to the tweets occurring after the cases; aLag = -5 days means that we match the daily cases with the tweets volumefrom ﬁve days earlier, in other words, a 5-days lead.

Pre-peak period . As shown from Table 4, there is a clear trend of strongcorrelation (

P C > . p < .

05) with lags during the

Pre-peak period,reaching it’s maximum at -5 or -6 days, indicating that tweets volume maybe a predictor of outbreaks at an early period of the pandemic. Althoughthe current research on the incubation period of COVID-19 is inconclusive,several studies have suggested that the incubation period of COVID-19 is onaverage 5-6 days [38, 39, 40]. In this regard, we propose that the 5-6 daylead may be related to the lag between infection and the onset of symptomsto be detectable and conﬁrmed.

Free-contagious period . There is no clear trend of correlation with lags ex-cept the value of Luxembourg, indicating that tweets volume cannot be anindicator to predict the daily cases in the

Free-contagious period. The pe-riod only lasted for 4 days in Luxembourg, which is too small to make

P C areﬂection of the correlation. However, the

P C values show a highly negativecorrelation between Tweets volume and daily cases, which reﬂects there is ashort period of a downward trend in the discussion of the pandemic after itreached its peak, even though the number of cases continued to rise rapidly.This result validates the conclusion of Smith et al. [41] from our dataset, theynoted that public awareness of disease declines sharply after the peak, eventhough the infection rates remain high. In other words, the public’s interestin the pandemic declines after a period of heightened attention during the

Pre-peak period.

Measures period . There is a clear trend of correlation with lags, tweets volumebegins to level oﬀ, with a 0 or 1-day-lag moderate correlation (0 . > P C > . p < .

05) to the daily cases. Tweets volume cannot act as a predictorfor daily cases here, it ﬂuctuates with the number of cases on the current11

R Luxembourg Belgium France GermanyLag (days)

P C p value

Pre-peak period-10 0.613 0.001 0.127 0.513 -0.143 0.459 0.275 0.148 0.288 0.130-9 0.813 0.001 0.413 0.026 0.108 0.577 0.420 0.023 0.695 0.001-8 0.865 0.001 0.581 0.001 0.292 0.124 0.561 0.002 0.776 0.001-7 0.874 0.001 0.669 0.001 0.473 0.010 0.701 0.001 0.817 0.001-6

Free-contagious period-9 -0.854 0.001 -0.519 0.048 -0.875 -0.798 -0.909 -0.934 -0.911

Measures period-3 0.279 0.262 0.578 0.103 0.343 0.093 0.646 0.001 0.333 0.347-2 0.031 0.902 0.093 0.813 -0.121 0.564 0.509 0.011 -0.334 0.345-1 0.093 0.713 -0.004 0.992 0.134 0.524 0.562 0.004 0.175 0.6290 0.590 0.010 -0.572 0.108 -0.903

Decay period0 0.427 0.006 0.329 0.013 0.529 0.001 0.775 0.001 Table 4:

P C (Pearson’s correlation) between Tweets Volume and COVID-19 daily caseswith diﬀerent lags or previous day. It is worth noting that Pearson’s coeﬃcient is sensitive tooutliers and does not have robustness. With too few dates included, a singleoutlier can change the direction of the coeﬃcients. This period existed foronly 8 days in Luxembourg, resulting in an anomaly value (

P C = − . ecay period . The correlations between tweets volume and daily cases occurin two ways here — one is weakly correlated and the other is that althoughthere is a correlation, the trend of correlation with lags is not signiﬁcant.Both demonstrate that it is not possible to predict daily cases by tweetsvolume during this period.In summary, with the Spatio-temporal analysis of the correlation betweentweets volume and COVID-19 new cases during the four period of the pan-demic, we ﬁnd that tweets volume only can be a predictor of outbreaksduring the

Pre-peak period of the pandemic. Regardless of the time at which R ( t ) peaks , there is a 5-6 day lead between tweets volume and COVID-19daily cases, which may be related to the lag between infection and the onsetof symptoms. What’s more, Before a pandemic strikes, there is a high levelof public concern about the pandemic, and the Pre-peak period is the perfecttime to conduct the pandemic precaution advocacy [21] as public concernabout the pandemic will decrease after enter the Free contagious period. Onthe particularity of GR, we ﬁnd that the

Free-contagious period in GR andLuxembourg are exceedingly short, while Luxembourg is similarly short inthe Measure period. The reasons for this will be explored further in Section 5from the perspective of social behaviour on OSNs (

RQ2 ).

5. Topic Modelling and Classiﬁcation of Tweets

In order to have an actual understanding of the situation, and gain fur-ther insights into the behaviour of the public in GR and related countrieson social media, the tweets posted by users in GR and related countries areanalysed with BERT [18] and the LDA [19]. We extract the main dailytopics on tweets, and categorise the generated topics, ﬁguring out whetherthere are certain patterns of social behaviour on OSNs at the periods of thepandemic (

RQ2 ). More importantly, we investigate how GR, as a relationalurbanisation product, diﬀers from other countries in terms of of public con-cern on OSNs during the pandemic (

RQ2 ) and whether more attention to thepandemic and prevention measures in the early stages of the pandemic willshorten the Free-contagious period duration (

RQ3 ). The overall workﬂow ofour topic modelling and classiﬁcation tasks is shown in Figure 4.

Prior to topical modelling, the tweets data needs to bepreprocessed. Particularly, missing delimiters are detected according to the13 reprocessing TF-IDFLDABERTConcateationTweet text K-Means ClusteringUMAP TopicsTokensSentencesAutoencoderLatent Space Representation VisualizationCorpus Embedding vectorProbabilistic topic assignment vectorManually ClassificationSVM VisualizationTopic Classification

Figure 4: Workﬂow of topic modelling and classiﬁcation uppercase letter, all text are lower-cased, while URLs, mentioned usernamesand ‘RT’ are removed as well. Besides, punctuation and numbers are ﬁlteredout, typos are corrected by Symspell and stop words are removed. Thensince the tweets are based on the keyword search, we removed keywordssuch as ’coronavirus’,’koronavirus’,’corona’,’covid-19’,’covid’ from the textto avoid bias. At last, Natural Language Toolkit (NLTK) is used for taggingthe part-of-speech, stemming and tokenization. Nouns, verbs, adjectives andadverbs are selected. Topic modelling.

Aiming to identify the latent topics of the tweets postedby the public in GR and related countries, we adopt the general structure ofcontextual topic embedding method (CTE) in this paper, to extract dailytopic data and get a more accurate picture of topic trends. CTE mainlyconsists of two components, LDA and BERT, to extract diﬀerent informa-tion from sentences to embedding. LDA, a bag-of-words approach which iswidely used to identify latent subject information in a large-scale documentcollection or corpus has some drawbacks: it needs large corpus to train, https://github.com/wolfgarbe/symspell https://bit.ly/3hUQjzf γ .Besides, as our data are multilingual, some words appear less frequentlythan in English which is predominantly spoken and are easily overlooked inthe topic modelling, hence we adopt the TF-IDF model to determine wordrelevance in the documents [42]. And further feed the generated corpus byTF-IDF to LDA, instead of sample bag-of-words corpus. After obtaining theconcatenated vector in high-dimensional space, CTE uses an autoencoderto learn a low-dimensional latent space representation of the concatenatedvector with more condensed information. Then k -means [43] is implementedfor clustering, the number of clusters k , that is, the number of topics, re-served as a hyper-parameter. We extract the word frequency in each cluster,sort and then take the top ten as the representation topic of that cluster.In terms of visualisation, Uniform Manifold Approximation and Projection(UMAP) [44] is used for low-dimensional latent space degradation, which isthe state-of-the-art visualisation and dimension reduction algorithm.Country Coherence score Silhouette scoreGR 0.432 0.893Luxembourg 0.474 0.894France 0.351 0.590Belgium 0.377 0.864Germany 0.336 0.655 Table 5: Average coherence score and average silhouette score of CTE

Average coherence score and average silhouette score are utilised as themetrics of CTE. We calculated an average coherence score by calculatingthe topic coherence for each topic individually and averaging them. And theaverage silhouette score is the mean of the silhouette score for each day. Topicmodelling is conducted on daily tweets of GR and the related countries, We15 igure 5: A sample of UMAP clustering results ﬁnetune the topic models and arrive at the optimal n = 7 and γ = 0 . After getting 4,763 topics from topic modelling, we then randomly se-lected 2,435 topics and classiﬁed manually into the following 7 categories:1. ‘Wuhan & China’: Topics about Wuhan and China-related issues.2. ‘Measures’: Topics about basic information including symptom, anti-contagion and treatment measures of COVID-19.3. ‘Local news’: Topics about local COVID-19 news and daily new cases,deaths, etc.4. ‘International news’: Topics about international COVID-19 news16. ‘Policy and daily life’: Topics about COVID-19 related policies encom-pass lockdown, Closure of borders, limits on public gatherings and theimpact of the policies on daily life.6. ‘Racism’: Topics about racism.7. ‘Other’: Other topics.These manually classiﬁed topics are used to train a Support Vector Machine(SVM) [20] for supervised classiﬁcation. In particular, words of each topicare converted to word frequency vectors with TﬁdfVectorizer and coun-try are encoded with Label Encoder . The feature vector is made up withthese two elements. Since our manually labelled dataset is class-imbalance,Synthetic Minority Oversampling Technique [48], is utilised for oversamplingimbalanced the dataset and mitigate imbalances. The dataset is split 80% asthe training dataset and 20% as the test dataset. Grid search with 10-foldcross-validation is deployed on training dataset to ﬁnd the optimal hyperpa-rameter, and the ﬁnal SVM model is obtained with the entire training set.Table 6 shows the precision, recall, F1 score, support and Macro-averageF-Score of the trained classiﬁer for each topic category. Then, the obtainedSVM model is utilised to classify the rest of the topics. Table 7 shows thenumber of topics of each category for each country. The categories with thehigh percentages are identiﬁed as Wuhan & China and policy and daily life.In general, the number of topics about policy and daily life is much higherin Luxembourg (56.6%) than in other countries ( ave = 33 . Pre-peak period, the shaded regions in diﬀerentcolours indicate, in order,

Pre-peak period,

Free-contagious period,

Measures https://bit.ly/30bA8Ye https://bit.ly/39EO5kK Table 6: Metrics of the classiﬁcation results

Category GR Luxembourg Belgium France Germany Total1 245 168 287 202 315 1,2172 64 34 48 65 41 2523 99 44 109 285 110 6474 134 77 114 52 167 5445 353 525 370 250 295 1,7936 23 7 23 31 15 997 41 72 15 60 23 211Total 959 927 966 945 966 4,763

Table 7: Topic volume for each category/country (region) period, and

Decay period. The black dotted line illustrates the date on whichthe ﬁrst case appeared. With the exception of the GR, there is an interval oftime between the date of the ﬁrst case and the date of consecutive cases everyday. The solid black line indicates the date that new cases appear every daysince that date. For ease of discussion, we have named the day as ‘outbreakday’ (OD).

RQ2

In this section, we aim to answer the

RQ2 : whether there are certainpatterns of social behaviour on OSNs at diﬀerent periods of the pandemic;and how GR, as a relational urbanisation product, diﬀers from other countriesin terms of of social behaviour on OSNs during the pandemic.This behavioural pattern obtained from Figure 6 is slightly diﬀerent in ourcases compared with the conclusion of Bento et al. [9]. In France, Germany18 igure 6: Topic categories in GR and related countries and Belgium, the appearance of the ﬁrst case triggered only a small amountof discussion about the protective measures, and discussion about them doesnot start to increase until OD. In other words, the public did not really heedthe pandemic until OD, when the virus was already spreading. This mayexplain by the existence of a large (average = 27.3 days) interval betweenthe date of the ﬁrst case and OD in France, Germany, and Belgium. Duringthis interval, sporadic cases may not attract enough public attention, andthe public’s attention was still focused on China-related news. What’s more,the report of ﬁrst case does not stimulate discussions about policies and daily19ife as well, and discussion about it did not emerge frequently until OD.The early picture in Luxembourg and GR is diﬀerent. Figure 6 showsthat the public in Luxembourg and GR started to have discussions aboutmeasures 1-2 days before the ﬁrst case appeared. This may be explainedby the late occurrence of the ﬁrst case in Luxembourg and GR, where theother three countries have already passed OD, the outbreak in other countriesmay have attracted public attention in GR and Luxembourg. Interestingly,in Luxembourg, the discussion about policies and daily life persisted beforethe ﬁrst case was announced and increased immediately after then. A wordcloud of the topics from 22 January to 1 March (date of the ﬁrst case) ofLuxembourg is depicted in Figure 7, it shows that the topics are mainlytravel-related. This may be explained by the fact that the proportion offoreign residents in the Luxembourg region is 47.4% , and residents are moreconcerned about travel-related policies in Luxembourg and other countries.As a relational urbanisation product, however, GR exhibits a high levelof interest in policy and daily life with tweets volume for 47.1% of totaltweets volume during the Free-contagious and the

Measures period, whileLuxembourg, the central region of GR, this percentage is 66.1%. Figure 8(a)shows boxplots of the distribution of the CR on policy and daily life duringthe

Free-contagious and the

Measures period. This shows that the publicis more responsive to policies as a region that relies on foreign labour andhas high mobility than Belgium, France and Germany. Figure 8(b) showsthat during the

Free-contagious and the

Measures period, public interest inlocal news was similar in all regions except for France. Moreover, during the

Decay period, while there is a downward trend ( p < .

05) in the total dailytweets volume, there is a upward trend ( p < .

05) in the CR of policy anddaily life, except in Luxembourg, where the rate is consistently high.

RQ3

As can be seen from Figure 6 and Figure 3, the discussion volume aboutmeasures during the

Pre-peak period may aﬀect the length of the

Free-contagious period ( p < . Pre-peak period, the CR of measures was much higher in GR (3.41%) and https://bit.ly/3fdhgwj igure 7: Word cloud of Luxembourg Tweets from 2020-01-22 to 2020-03-01(a) Policy and daily life (b) Local newsFigure 8: Distribution of proportion of tweets on policy and daily life Luxembourg (7.62%) than in France (1.90%), Belgium (1.84%) and Germany(0.0%). It should be noted that the discussion of measures is not actuallynon-existent in Germany, but the tweets volume may be too small to berecognised as separate topics during the topic modelling process. From theresults of topical classiﬁcation, it is evident that there may be an underes-timation of the severity of the pandemic by the public in Germany, Franceand Belgium during the

Pre-peak period. The underestimation can be inter-preted as optimism bias , which make people believe their exposure to diseaseis low [49]. During a pandemic, people often exhibit an optimism bias, acognitive bias that causes someone to believe that they will be less likely toget involved in negative events [50]. In our results, it shows that even thoughthe public in Belgium, France and Germany have shown sustained and long-21erm concern about COVID-19 occurring in China on OSNs, optimism biasemerged when COVID-19 appeared, causing the public to ignore the emer-gence of the cases and to show low support for anti-contagious measures andgovernment policies [51]. Thus, tweets about anti-contagion or treatmentmeasures may serve as an indicator of whether the public are experiencingan optimism bias, and to some extent, shorten the

Free-contagious periodand public knowledge may aﬀect the impact of the policies.Country Lockdown Maximum gatherings School closuresLuxembourg 3/15/2020 No gatherings 3/13/2020Belgium 3/18/2020 No gatherings 3/13/2020France 3/17/2020 No gatherings 3/13/2020Germany 3/22/2020 2 individuals 3/13/2020

Table 8: Anti-contagion policies in diﬀerent countries

6. Conclusion and Discussion

In this paper, we have studied the COVID-19 related content on Twitter,and innovatively introduced the idea of relational urbanisation and chosenthe Greater Region and its related country as the research area. Our analysesfocused on three research questions: can the Twitter posts volume be a pre-dictor of COVID-19 daily cases during a long-term period? (

RQ1 ), whetherthere are certain patterns of social behaviour on OSNs at diﬀerent periods ofthe pandemic and how GR diﬀers from other countries? (

RQ2 ), and whethermore attention to the pandemic and anti-contagious measures in the earlystages of the pandemic will impact the eﬀectiveness of control and protectivemeasures deployed by the governments? (

RQ3 ).With the spatio-temporal analysis of the correlation between tweets vol-ume and COVID-19 new cases during the four periods of the pandemic, ourgeneral answer to

RQ1 is that tweets volume only can be a predictor of out-breaks during the Pre-peak period of the pandemic and the lead betweentweets volume and COVID-19 daily cases may be related to the lag betweeninfection and the onset of symptoms. For

RQ2 , we have shown a certain pat-tern in the main categories of public discussion on social media during thepandemic. In the early stage, the public focused on news related to China.If the ﬁrst case occurs in a country and there is no rapid outbreak, the public22oncern will not be diverted to anti-contagious and treatment measures, poli-cies and local news until a complete outbreak, that cases begin to occur dailyin that region. GR, as a region with a large number of cross-border workers,has shown high interest in policies since the ﬁrst case, even if no lockdownpolicy has been implemented at that time. At the same time, Luxembourg,which has a foreign resident population of 47.4%, has shown a great concernfor policies including travel from the beginning of the pandemic, which is notfound in other regions. For

RQ3 , our answer is that the discussion volumeabout measures during the

Pre-peak period may aﬀect and shorten the lengthof the

Free-contagious period.Our results in this paper can be used to understand social behaviour onOSNs during the pandemic, and the diﬀerences of the social behaviour inrelational urbanisation regions when facing the pandemic. We identify whenis the perfect time to conduct the pandemic precaution advocacy which helpto generate policy support.There are still some limitations of our study. First, in our dataset, wedid not take into account bots that post misleading information, which canlead to a possible bias in topical modelling and classiﬁcation. For our initialexploration of topic categories, we chose SVM to build a baseline methodfor topic classiﬁcation. We will utilise other state-of-the-art text classiﬁca-tion methods to reﬁne the classiﬁcation in further study. Second, our casesstudy has some statistical limitations, and data from more countries will beincluded in future studies to ensure the statistical signiﬁcance of the con-clusions. Third, more research can be performed based on our dataset. Forexample, in future, we will conduct sentiment analysis on the tweets of dif-ferent categories at each pandemic period to ﬁnd out if there is a certainpattern in the public’s sentiment about the pandemic and how it diﬀers fromGR to other countries. And for

RQ3 , multi-class sentiment analysis withBERT will be conducted to ﬁgure out whether and to what extent peopleare optimistic or pessimistic about being aﬀected by a pandemic during the

Pre-peak period. Finally, during the writing of this article, the second wave ofCOVID-19 began to appear in Luxembourg and some countries. In a futurestudy, we will conduct a comparative study focusing on the regions wherethe second wave occurred. Sentiment analysis and text classiﬁcation withthe state-of-the-art method will be deployed to investigate whether OSNsinformation that may reﬂect public attitude and behaviour. We will attemptto identify social behaviour that may lead to the second wave, such as laxityor resistance to policies and anti-infection measures. Such timely indicators23re potentially useful for appropriate policy changes to avoid a new pandemicoutbreak.

Acknowledgements.

This work was partially supported by LuxembourgsFonds National de la Recherche, via grant

COVID-19/2020-1/14700602 (Pan-demicGR).

References [1] M. Cinelli, W. Quattrociocchi, A. Galeazzi, C. M. Valensise, E. Brugnoli,A. L. Schmidt, P. Zola, F. Zollo, A. Scala, The covid-19 social mediainfodemic, arXiv preprint arXiv:2003.05004 (2020).[2] L. Singh, S. Bansal, L. Bode, C. Budak, G. Chi, K. Kawintiranon,C. Padden, R. Vanarsdall, E. Vraga, Y. Wang, A ﬁrst look at COVID-19 information and misinformation sharing on Twitter, arXiv preprintarXiv:2003.13907 (2020).[3] K. Jahanbin, V. Rahmanian, Using Twitter and web news mining topredict COVID-19 outbreak, Asian Paciﬁc Journal of Tropical Medicine13 (2020) 26–28.[4] C. Wang, B. David M., Collaborative topic modelling for recommendingscientiﬁc articles, in: Proceedings of the 2011 International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), ACM, 2011, pp.448–456.[5] C. Ordun, S. Purushotham, E. Raﬀ, Exploratory analysis of covid-19 tweets using topic modelling, UMAP, and digraphs, arXiv preprintarXiv:2005.03082 (2020).[6] R. J. Medford, S. N. Saleh, A. Sumarsono, T. M. Perl, C. U. Lehmann,An ”Infodemic”: Leveraging high-volume Twitter data to understandpublic sentiment for the COVID-19 outbreak, medRxiv (2020).[7] K. Sharma, S. Seo, C. Meng, S. Rambhatla, Y. Liu, COVID-19 on So-cial Media: Analyzing Misinformation in Twitter Conversations, arXivpreprint arXiv:2003.12309 (2020).[8] S. Gupta, T. D. Nguyen, F. L. Rojas, S. Raman, B. Lee, A. Bento, K. I.Simon, C. Wing, Tracking public and private response to the COVID-1924pidemic: Evidence from state and local government actions, Tech. rep.,National Bureau of Economic Research (2020).[9] A. I. Bento, T. Nguyen, C. Wing, F. Lozano-Rojas, Y. Y. Ahn, K. Si-mon, Evidence from Internet search data shows information-seeking re-sponses to news of local COVID-19 cases, Proceedings of the NationalAcademy of Sciences of the United States of America (PNAS) 117 (21)(2020).[10] C. E. Lopez, M. Vasu, C. Gallemore, Understanding the perception ofCOVID-19 policies by mining a multilanguage Twitter dataset, arXivpreprint arXiv:2003.10359 (2020).[11] M. Thelwall, S. Thelwall, Retweeting for COVID-19: Consensus build-ing, information sharing, dissent, and lockdown life, arXiv preprintarXiv:2004.02793 (2020).[12] P. Adler, R. Florida, M. Hartt, Mega regions and pandemics, Tijdschriftvoor Economische en Sociale Geograﬁe 111 (3) (2020) 465–481.[13] S. H. Ali, R. Keil, Networked disease: Emerging infections in the globalcity, Wiley-Blackwell, 2009.[14] C. Connolly, R. Keil, S. H. Ali, Extended urbanisation and the spa-tialities of infectious disease: Demographic change, infrastructure andgovernance, Urban Studies (2020).[15] M. Hesse, M. Raﬀerty, Relational cities disrupted: reﬂections on theparticular geographies of COVID-19 For small but global urbanisationin Dublin, Ireland, and Luxembourg City, Luxembourg, Tijdschrift vooreconomische en sociale geograﬁe 111 (3) (2020) 451–464.[16] T. Sigler, K. Martinus, I. Iacopini, B. Derudder, The role of tax havensand oﬀshore ﬁnancial centres in shaping corporate geographies: An in-dustry sector perspective, Regional Studies (2019).[17] J. A. P. Heesterbeek, K. Dietz, The concept of Ro in epidemic theory,Statistica Neerlandica 50 (1) (1996) 89–110.2518] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training ofdeep bidirectional transformers for language understanding, in: Proceed-ings of the 2019 Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (HLT), Vol. 1, ACL, 2019,pp. 4171–4186.[19] D. M. Blei, A. Y. Ng, M. T. I. Jordan, Latent dirichlet allocation, Jour-nal of Machine Learning Research 3 (2003) 993–1022.[20] C.-C. C. Chang, C.-J. J. Lin, LIBSVM: A library for support vectormachines, Transactions on Intelligent Systems and Technology 2 (3)(2011) 1–27.[21] P. M. Sandman, Crisis communication best practices: Some quibblesand additions, Journal of Applied Communication Research 34 (3)(2006) 257–262.[22] C. St Louis, G. Zorlu, Can Twitter predict disease outbreaks?, BritishMedical Journal 344 (7861) (2012).[23] S. Hsiang, D. Allen, S. Annan-Phan, K. Bell, I. Bolliger, T. Chong,H. Druckenmiller, L. Y. Huang, A. Hultgren, E. Krasovich, Others, Theeﬀect of large-scale anti-contagion policies on the COVID-19 pandemic,Nature (2020) 1–9.[24] C. Courtemanche, J. Garuccio, A. Le, J. Pinkston, A. Yelowitz, Strongsocial distancing measures in the United States reduced The COVID-19Growth Rate, Health Aﬀairs (2020) 10–1377.[25] T. Dergiades, C. Milas, T. Panagiotidis, Eﬀectiveness of governmentpolicies in response to the COVID-19 outbreak, SSRN (2020).[26] D. Hu, X. Lou, Z. Xu, N. Meng, Q. Xie, M. Zhang, Y. Zou, J. Liu,G. P. Sun, F. Wang, More eﬀective strategies are required to strengthenpublic awareness of COVID-19: Evidence from Google trends, Journalof Global Health 10 (1) (2020).[27] M. Eﬀenberger, A. Kronbichler, J. I. Shin, G. Mayer, H. Tilg, P. Perco,Association of the COVID-19 pandemic with Internet search volumes:A Google trends