[PDF] Studying Leaders & Their Concerns Using Online Social Media During The Times Of Crisis -- A COVID Case Study

Abstract

Online social media (OSM) has emerged as a prominent platform for debate on a wide range of issues. Even celebrities and public figures often share their opinions on a variety of topics through OSM platforms. One such subject that has gained a lot of coverage on Twitter is the Novel Coronavirus, officially known as COVID-19, which has become a pandemic and has sparked a crisis in human history. In this study, we examine 29 million tweets over three months to study highly influential users, whom we refer to as leaders. We recognize these leaders through social network techniques and analyze their tweets using text analysis. Using a community detection algorithm, we categorize these leaders into four clusters: research, news, health, and politics, with each cluster containing Twitter handles (accounts) of individual users or organizations. E.g., the health cluster includes the World Health Organization (@WHO), the Director-General of WHO (@DrTedros), and so on. The emotion analysis reveals that (i) all clusters show an equal amount of fear in their tweets, (ii) research and news clusters display more sadness than others, and (iii) health and politics clusters are attempting to win public trust. According to the text analysis, the (i) research cluster is more concerned with recognizing symptoms and the development of vaccination; (ii) news and politics clusters are mostly concerned with travel. We then show that we can use our findings to classify tweets into clusters with a score of 96% AUC ROC.

Full PDF

SStudying Leaders During Times of Crisis Using OnlineSocial Media - A COVID Case Study

Rahul Goel a,1 , Rajesh Sharma a a Institute of Computer ScienceUniversity of Tartu, Estonia

Abstract

Online Social media (OSM) has become a primary platform for discussion on di-verse topics. Even famous and public ﬁgures often express their views on varioustopics through OSM platforms. Novel Coronavirus oﬃcially called COVID-19,which has become a pandemic and has created a crisis in human history, is onesuch topic that has attracted a lot of attention on Twitter in recent times. Inthis work, we analyze 29 million tweets spanning across three months, to studyhighly inﬂuential users, which we called as leaders. We identify these leadersusing social network analysis and analyze their tweets using text analysis tech-niques. We group these leaders in four clusters, namely research , news , health and politics . Our analysis shows that i) all clusters show a similar amount of fear in their tweets, ii) researchers and news cluster display more sadness com-pared to others and, iii) health organizations and politicians try to gain public trust . The text analysis shows that the researchers are more concerned aboutunderstanding symptoms and developing vaccination ; news and politicians aremainly discussing about travel and hygiene ; and health organizations focuseson hygiene . Our descriptive analysis helps us to extract various features that weused to classify tweets among the four clusters with an accuracy of 96% AUCROC. Keywords:

Twitter, Leaders, COVID-19, WordCloud, NLP. Corresponding author: [email protected]

Preprint submitted to Elsevier January 11, 2021 a r X i v : . [ c s . S I] J a n . Introduction A leader is a person who can reach and inspire a large number of people withhis/her opinions. Crisis times provide an opportunity to test these leaders andassess their ability to take timely and necessary actions. Although the leadersare often prepared for the anticipated crisis scenario, the true challenge arisesin unforeseen circumstances. Starting from Arab spring [1, 2], to hurricane[3], to ﬁnancial crisis [4], leaders have used social media platforms to share theiropinions and supervision. Such supervisions provide support and trust to peopleto overcome the crisis situation. Social media platforms help these leaders toeﬃciently communicate their message with a large number of crisis-aﬀectedpeople.OSM platforms such as Facebook , Instagram , Twitter , etc. have becomean important part of individuals. The nature of these platforms allow users tobroadcast their opinions, discuss current issues, and express about things thatinﬂuence their daily life. In fact Politicians and famous public ﬁgures also useonline social media (OSM) platforms as a medium for sharing their opinionsand getting a sense of general view of how people think about them. Thesefamous users, which have considerable followers on OSM often function as aleaders of OSM platforms as they can communicate with a large number ofpeople eﬀectively and inﬂuence them with their opinions.In this work, we study Twitter, as a representative of OSM, for understand-ing leaders during the time of crisis by taking COVID-19 as a case study. Wedeﬁne Leaders as users with high PageRank values. We group leaders into clus-ters and study them by analyzing their tweets. This reveals the inclination andconcerns of the various clusters during COVID-19 (refer Section 5 for detail).Given the fact that the COVID-19 pandemic aﬀected more than 180 countries https://twitter.com/ Note: In this work, “leader” represents a Twitter handle and “cluster” represents a groupof Twitter handles. .In this paper, we analyze 29 million tweets from 6 million unique users,collected using Twitter’s streaming API and Tweepy by using COVID-19 re-lated keywords (“coronavirus”, “coronovirusoutbreak” and “COVID-19”). Ourdataset spans from February 01, 2020 to May 02, 2020 to covers the followingresearch directions:1. Identifying leaders and their concerns:

We analyze our dataset toidentify the leaders during COVID-19 crisis using social network analysistechniques. We present leaders in four leading clusters namely research , news , health , and politics . Our ﬁnding shows that leaders are mostlyconcern about ﬁve major topics that are related to (1) symptoms ; (2) vaccination ; (3) hygiene ; (4) travel ; and (5) pandemic (Section 4).2. Leaders alignment towards concerns:

We use text analysis techniqueto understand the alignment of clusters with respect to various concernsduring the COVID-19 crisis. We observe that researchers are more con-cerned about understanding the symptoms and developing vaccination ; news about travel and hygiene ; health organization about hygiene ; andpoliticians about travel and hygiene (Section 5).3. Tweet classiﬁcation :

On the insights of the leader’s tweet, emotionsand public concerns, we build a model to estimate the probability that atweet belongs to a speciﬁc cluster. Our ﬁndings can be used to classifytweets among clusters with an accuracy of 96% AUC ROC (Section 6).In the past, various studies have been performed to study leadership during

2. Related Work

Our work lies at the intersection of leadership, social media and epidemics.Thus, in this section, we ﬁrst present literature concerning various works whichhave studied the leadership during a crisis. Next, we present related worksabout epidemics study with respect to OSM platforms.

The majority of works have studied the leadership crisis in the politicaldomain. For example, in [11], the authors highlighted that there is very littlescope of reform during crisis times against the common belief that these timesare best suited for change. A ﬁnding based on gulf leaders echoed this ideamore strongly as the leaders missed the opportunity to take stands conclusively[11]. In contrast, during crisis time individuals can even allow their democraticgovernment to propose technocratic guidelines for the safety of the society [12].In another study based on social media posts about the Egyptian revolution,the researchers reported that leaders are no more than connective leaders [13].Apart from political domains, leadership during crisis times has also beenstudied in administration settings. For example, in the aftermath of Hurricane4atrina in the US, the mixed approach studies reveal that school employeesexpect their leaders to create a proper plan [14], and take quick actions in ad-dition to regular engagement with them [15]. In a diﬀerent study, the portrayalby media for administrators dealing with the Katrina crisis was criticized in [3].Industry settings are another area where leadership plays an essential rolein crisis management. In [4], the authors analyses more than 30 crises andinvestigated the role of the chief executive oﬃcer (CEO). The study pointed outthat CEOs need to be proactive at the start of the crisis. Simulations basedapproach was proposed in [16], for training leaders to handle crises eﬀectively.For professionals working under uncertain circumstances and especially thoseexposed to crisis situations, there is a strong likelihood that they may experi-ence more negative emotions like stress and anxiety [17, 18]. Such emotionalreactions often arise from the potential risk of losing one’s job and from theneed to have immediate and eﬀective responses [19, 20]. To handle these sit-uation, organizational leaders have a crucial role to play in allowing long-termpreparation for the benﬁts of employees and organization [21, 22].

Researchers have analysed Twitter data for studying various epidemic out-breaks such as Ebola [23, 24], H1N1 [25], ﬂu [26], swine ﬂu [27] etc. In [28], theauthors illustrate the potential of monitoring public health using social mediaduring the H1N1 pandemic. In another work [29], authors track the disease leveland public concern during the H1N1 pandemic in the US using Twitter data.A set of work have also focused on understanding the epidemic spreading pat-terns at diﬀerent geographic locations using social media data. In [30], authorstrack the spread of inﬂuenza-A (H1N1) in India using Twitter data. Vaccinationand anti-viral uptake during H1N1 in the UK are studied in [31] using tweets.In [32], the authors studied the Nipah virus in India. In a very recent work [33],researchers compared Google and Twitter data for predicting inﬂuenza cases inGreece.To the best of our knowledge, this is the ﬁrst work related to the coronavirus,5hich has analyzed online social media such as Twitter for identifying leadersacross diﬀerent domains and their respective concerns.

3. Dataset Description

This section ﬁrst provides information about the coronavirus (oﬃcially knownas COVID-19), and Twitter, an online social media platform. Next, the datacollection and pre-processing are discussed in the following subsections.

In the last decade, humanity has faced many diﬀerent pandemics such asSARS, H1N1, and presently novel coronavirus (COVID-19) [34]. The severityof these pandemics can be understood by the death toll claimed by them. TheCOVID-19 pandemic, which started in December 2019 from Wuhan [35, 36],China has infected 21,160,149 individuals and claimed 759,166 (as of August14 th , 2020) deaths worldwide [5, 37]. Given the fact that this pandemic orig-inated from Wuhan, China and as of 14 August, 2020 China is not even onCOVID-19’s top 25 nations, shows the fast spread of this pandemic to othernations around the world. This impacted individuals life by keeping them inisolation, quarantine, emergency, or total lockdown. During these lockdownconditions, individuals are inclined to use social media platforms to share theirthoughts and experiences. By May 8, 2020, Twitter has attracted more than628 million tweets related to coronavirus which makes it a signiﬁcant onlinesocial media platform for COVID-19 study . Twitteris an online social media platform classiﬁed as a microbloging sitewith which users can share messages, links to external Web sites, images, orvideos that are visible to their followers. Messages that are posted are short incontrast to traditional blogs. Blogging becomes ‘micro’ by shrinking it down to . Apart from Twit-ter, there are other microblogging platforms such as Tumblr , FourSquare ,Google+ , and LinkedIn of which Twitter is the most popular microbloggingsite. In the past, researches have shown that Twitter data is well suited forsentiment analysis and opinion mining [39]. The data collection is done using Twitter Streaming Application Program-ming Interface (API) and Python (see Figure 1). API is a tool that facilitatesthe interaction between computer programs and Web services. It enables thereal-time collection of data by tracking the live stream of public tweets. ManyWeb services provide developers with APIs for interacting with their services,and for programmatically accessing data. Python library such as tweepy alsoassist this task by providing functions that can track live streams of publictweets using hashtags or usernames.In this work, we use the Twitter Streaming API and Python package tweepy to download tweets related to various keywords regarding rapidly increasingcoronavirus disease (COVID-19). We collect and store a large sample of pub-lic tweets beginning February 1, 2020, that matched a set of pre-deﬁned key-words: “coronavirus” and “coronovirusoutbreak”. The additional keyword islater added such as “COVID-19” (On February 11, 2020, COVID-19 was theoﬃcial name given to novel coronavirus by WHO). The objective of using thesespeciﬁc keywords is to collect tweets belonging to COVID-19. The Twitter data https://developer.twitter.com/en/docs/basics/counting-characters https://foursquare.com/ http://plus.google.com http://linkedin.com/ ata collection using Twitter API and Python package tweepyTweets preprocessingClean textEmotion Analysis Public concerns Identification Figure 1: Tweets collection, preprocessing, emotions and public concern identiﬁcation ﬂow. we collect is stored in JSON format to make it easy for parsing and analysis.

Figure 2 shows the basic steps we took in preprocessing the tweet dataset.We start our tweet preprocessing by removing all the URL’s as they do notinclude any useful information for text analysis. Afterward, the tweets aresubjected to lowercasing. The emojis, emoticon , numbers, mentions (@) andhashtags ( are converted into its realmeaning to understand the real context of the tweet.Furthermore, we proceeded with removing every punctuation marks and avariety of diﬀerent stopwords available in nltk package. We also perform thespell correction using SpellChecker package in Python. Using nltk package, we https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b https://github.com/rishabhverma17/sms slang translator/blob/master/slang.txt weets Remove URLs ToLower RemoveEmoticons Remove Emojis Remove HTML Tags RemoveNumbers RemoveMentions,Hashtags Convert Slang RemovePunctuations RemoveStopwords SpellingCorrection Stemming,LemmatizationRemove Top10 Freq WordsCleanText Remove Top10 Rare Words Figure 2: Tweets preprocessing.

Parameter Value

Time period 01-02-2020 to 02-05-2020

Table 1: Description of the dataset. did the stemming and lemmatization of the tweet text.The dataset’s size interms of total number of tweets, original tweets, retweets, number of distinctusers, and the number of distinct features are shown in Table 1.

4. Studying Leaders During COVID-19 Crisis

To understand the users’ interactions and to identify leaders, we ﬁrst builda network by capturing the user’s social connections. We build the directed“retweet” network among users (Figure 3) where an edge a −→ b indicates thatuser a retweet user b . To identify users with similar alignment, we grouped theminto communities using the Louvain modularity method and for better repre-sentation each community is color-coded. Furthermore, to identify importantnodes in the network, we employ the PageRank algorithm [40], a well-knownalgorithm to characterize the centrality of nodes. PageRank reﬂects the impor-tance of a node in the retweet network, and a higher PageRank value representsinﬂuential users who can spread their tweet content to a community much faster9ompared to users with lower PageRank value. Therefore, these inﬂuential userswhich are the major sources of information in the network regarding COVID-19are termed as leaders .The top-200 leaders are categorized (in Figure 3, leaders are shown usingrelatively bigger nodes indicating higher PageRank values) into four clusters, Health Organizations , Politics , News Organizations and

Research Or-ganizations , depending upon their publicly available information such as pro-ﬁle description. Some of the examples of users under four diﬀerent clustersare as follows, (i) health organizations : which includes health organizationsrelated Twitter handlers such as

World Health Organization (@WHO), Cen-ters for Disease Control and Prevention (@CDCgov), Director General WHO(@DrTedros), Global Health Strategies (@GHS) and World Health OrganizationWestern Paciﬁc (@WHOWPRO) ; (ii) politics : includes Twitter proﬁles relatedto

U.S. President (@realDonaldTrump) and U.S. House Candidate ; (iii) newsorganizations : examples of Twitter handlers include

China Global TelevisionNetwork (@CGTNOﬃcial), Al Jazeera English (@AJEnglish) and

Global Times(@globaltimesnews) ; and (iv) research organizations : such as

ISCR . This in-dicate that, on Twitter, users are (re)sharing information from more reliablesources regarding COVID-19.Figure 3 shows that these leaders also interact with each other. CDC is at thecenter of the network and all other leaders such as WHO, CGTN, GHS, Director-general WHO and U.S House candidates are well connected with CDC. We alsonoticed that the U.S. President is closely connected with the director-generalof WHO. News handlers other than @CGTNOﬃcial are tightly connected withfew leaders. For example, Al Jazeera English (@AJEnglish) is closely followingthe tweets from WHO and WHO Western Paciﬁc. We further analyze thedominant emotions for each cluster as well as their alignment towards variouspublic concerns during COVID-19.

Dominant emotions for various clusters:

We analyze tweets to discover10

DCDirector-GeneralWHO GHSU.S.President WHO U.S.HouseCandidate AlJazeeraEnglishGlobalTimesWHOWesternPacificCGTN ISCR

Figure 3:

COVID-19 Related Users Network.

Color-coded group represents communi-ties. To identify important users (or major sources of information), PageRank algorithm isused. each clusters’ dominant emotions. We use the

Syuzhet package [41] availablein R for emotions extractions. The Syuzhet package is designed to work on asentence level and repeated words do not count towards the emotion assignment.This means that annotation is performed on sentence level and not on paragraphlevel.We observe that

Anticipation , Fear , Sadness and

Trust are dominant emo-tions for all clusters compared to

Anger , Disgust , Joy and

Surprise (see Figure https://github.com/mjockers/syuzhet fear and anticipation in their tweets. Political and health organizations clustersindicate higher trust compared to others. On the other hand, researchers andnews clusters display greater sadness towards the COVID-19 situation. There-fore, we can infer that leaders are displaying fear, anticipation and sadness butat the same time, they are trying to build a trust environment among public. Figure 4: Dominant emotions in tweets from various leaders.

In addition, we perform the text analysis to understand the focus of theinterest of leaders during the COVID-19. We use the topic modeling techniqueto extract these interests. In particular, we use the Latent Dirichlet Allocation(LDA) [42] method with Gibbs sampling [43] which is an unsupervised andprobabilistic machine-learning topic modeling method that extracts topics fromtext data. The key assumption behind LDA is that each given text is a mix ofmultiple topics. 12e extract ten topics from tweets’ text. The LDA returns a set of ten wordsrelated to each identiﬁed topic (but not the title of the topic). We then as-sign appropriate topics to each set of words that closely reﬂect the topic at anabstract level. It has been observed that leaders are mostly discuss ﬁve top-ics related to various public concerns during the COVID-19. These concernsare: (1) disease symptoms , (2) disease vaccination , (3) disease countermeasuresor hygiene , (4) disease transmission during travel and (5) COVID-19 as pan-demic/epidemic . In the next section, we analyze the clusters’ alignment towardthese public concerns.

5. Clusters Alignment Towards Various Concerns

To study the alignment of various clusters towards ﬁve most discussed topicsor concerns discovered in Section 4 regarding rapidly increasing coronavirus dis-ease (COVID-19), we annotate each tweet using speciﬁc keywords. For example,disease symptoms tweets are annotated (i.e., using keyword symptom ); vacci-nation (i.e., using keywords vaccine and vaccination ); disease countermeasures(i.e., keywords hygiene , wash , hand and mask ); disease transmission duringtravel (i.e., keywords travel , ﬂying , ﬂy , airplane , ﬂight and trip ) and pandemic(i.e., keywords pandemic and epidemic ). In this section, we start by uncoveringthe speciﬁcs of these public concerns to understand them in more detail. Next,we explain the alignment of leading clusters towards the above mentioned publicconcerns. Tweets analysis belonging to the symptoms category shows that the dailypercentage of observed COVID-19 tweets related to symptoms is higher thanother concerns which reﬂects that users on Twitter are more concerned aboutsymptoms of the disease. Furthermore, we extract the symptoms for coronavirusfrom tweets (see Figure 5a). The identiﬁed symptoms are fever, cough and cold.Leaders are also discussing the similarity of COVID-19 and inﬂuenza/ﬂu, as13oth spreads among people in a similar way, that is, via respiratory dropletsfrom coughing or sneezing [44]. Leaders are calling the COVID-19 as an asymp-tomatic disease as the time between exposure and symptom onset is typically ﬁvedays, but may range from two to fourteen days with mild fever among healthypeople [44]. In particular, we also observe that research cluster is more concernabout understanding the COVID-19 symptoms compared to other clusters (seeFigure 6). respiratory infection fever mild risk spreading cold c on t a c t asymptomatic days severe qua r an t i ne c ough sick sy m p t o m a t i c ho m e italy china care flu cdc infected rate transmission spread (a) Symptoms china spread human news e ff e c t i v e s a r s research pandemic death t r i a l s flu public world infected cure scientists t r ea t m en t cdc sta testing won influenza stop rate test (b) Vaccine m a sks prevent ethiopia stop ho m e sick time public touch touching sanitizer f l u protect s p r ead avoid water wash pandemic cough safe hand washyourhands soap cdc stay (c) Countermeasures a v o i d quarantine chinese traveling world korea pandemic restrictions canada iran ban spread japan risk india italy cdc newscancel i n f e c t ed wuhan kenya south government international (d) TravelFigure 5: Wordcloud igure 6: Leaders alignment towards various public concerns. The analysis of tweets belonging to the vaccination category indicates thatas currently there is no vaccine and no speciﬁc treatment for COVID-19 isavailable, leaders are discussing the eﬀectiveness of ﬂu vaccination for COVID-19 as both ﬂu and COVID-19 cause respiratory disease (see Figure 5b). Wecan also infer that leaders are discussing intensely the ongoing research to cureCOVID-19. Therefore, we can infer that leaders on Twitter are very conscious ofthe symptoms of COVID-19 and also keeping an eye on the ongoing vaccinationresearch for COVID-19. As no vaccine is presently available, all clusters arecautiously tweeting regarding the vaccination (see Figure 6).

We identify the tweets belonging to countermeasures category using key-words such as hygiene, wash, hand and mask . Furthermore, we partition coun-termeasures category tweets into

Hygiene (i.e., keywords hygiene, wash, hand )and

Mask (i.e., keywords mask ). Note that the volume of tweets pertaining to15 ask is higher than

Hygiene before February 24, and after that leader started fo-cusing more on

Hygiene compared to

Mask . This indicates that initially, leaderswere tweeting about wearing masks but later they shifted their countermeasuresstrategy towards taking proper

Hygiene against COVID-19.To explore the countermeasures against COVID-19, we created a wordcloudfrom countermeasures category tweets (see Figure 5c) indicating prevention sug-gestions including handwashing , respiratory hygiene , self-isolation and self-quarantine . Hand wash using soap and water or sanitizer is highly dis-cussed for countermeasures. Handwashing is also recommended by the US Cen-ters for Disease Control and Prevention (CDC) to prevent the spread of thedisease. It recommends that people wash hands often with soap and water forat least 20 seconds, especially after going to the toilet or when hands are visiblydirty; before eating; and after blowing one’s nose, coughing, or sneezing. Itfurther recommended using an alcohol-based hand sanitizer with at least 60%alcohol by volume when soap and water are not readily available . The advicefrom CDC to avoid touching the eyes, nose, or mouth with unwashed handsand for respiratory hygiene , cover the mouth with masks and take precautionsin case of coughing are also highly tweeted.Tweeting percentage for Health organizations and politicians are higher re-garding hygiene compared to news and research clusters (see Figure 6). We canconclude that leaders on Twitter are spreading awareness towards taking properhygiene measures against COVID-19 issued by WHO and CDC.

We focus on traveling via ﬂight to understand the eﬀect of COVID-19 ontravel. We identify the tweets belonging to the travel category using keywordssuch as travel, ﬂying, ﬂy, airplane, ﬂight and trip . To explore the eﬀect ofCOVID-19 on traveling, we create a wordcloud from travel category tweets .Among clusters, News and political clusters are more focusing on travellingcompared to researchers and health organizations (see Figure 6).

As COVID-19 outbreak is ﬁrst identiﬁed in Wuhan, China, in December2019 [35]. The World Health Organization (WHO) declared the outbreak a

Public Health Emergency of International Concern on January 30 th th researchers, news and health clusters are frequently using pandemic orepidemic while discussing about COVID-19 compared to politicians (see Figure6). To summarize, on analyzing the clusters’ alignment towards various publicconcerns (see Figure 6), we ﬁnd that researchers are highly concerned aboutunderstanding the COVID-19 by studying its symptoms and development ofthe vaccination. News are discussing travel and hygiene.

Health organizationsare focusing on hygiene. Whereas, political people are highly concerned abouttravel and hygiene. This indicates that the diﬀerent clusters are focused onspeciﬁc public concerns. This can be viewed as a positive approach, since variousclusters are focusing on speciﬁc issues as well as engaging with each other oncommon concerns.

6. Tweet Classiﬁcation In Clusters

Next, taking insights from previous sections, we build a model to estimatethe likelihood that a tweet belongs to a speciﬁc cluster. The tweets percentage weet clean text Emotions Public concernsEmbedding ConcatenateMulti-label Classification (SVC, RF, NN or BERT)Output Figure 7: Flow diagram for features concatenation and model selection. and count belonging to diﬀerent clusters are as follows:

News (36%; 15,289),

Health (33%; 14,023),

Research (18%; 7,635) and

Politics (13%; 5,521). Astweet distribution among clusters is imbalanced, this problem is represented asan imbalanced multi-label classiﬁcation [46].

To illustrate the predictive power of various features sets as mentioned inthe earlier section, we deﬁne a series of models, each with a diﬀerent feature set.We focus on three diﬀerent types of features:1.

Tweet text:

This refers to the clean text extracted after preprocessingthe original tweets (see Section 3.4).2.

Emotions:

This refers to the emotions associated with respect to eachtweet (see Section 4).3.

Public concerns:

This corresponds to the public concerns revealed inSection 5. 18 .2. Experimental setup and results

We aim to estimate the likelihood that a tweet belongs to a speciﬁc cluster byusing the features mentioned in Section 6.1. As in Section 4, we ﬁlter the leadersand cluster them into four groups. We remove all user tweets that are not in anyof the leader clusters. We also deleted all blank tweets after tweet preprocessing.After applying these ﬁlters, we obtain a dataset containing 42,468 tweets.Since the dataset is imbalanced, we use Synthetic Minority Over-samplingTechnique (SMOTE) [47] to resolve this problem. SMOTE works by selectingexamples that are close in the feature space, drawing a line between the examplesin the feature space and drawing a new sample at a point along that line.Speciﬁcally, a random example from the minority class is ﬁrst chosen. Then kof the nearest neighbors for that example is found (typically k=5). A randomlyselected neighbor is chosen and a synthetic example is created at a randomlyselected point between the two examples in feature space.We experiment with several classiﬁcation models, including Support Vec-tor Classiﬁer (SVC) [48, 49], and Random Forests (RF) [50], Random neuralnetwork (NN, see Figure 8 for framework) [51, 52] and Bidirectional EncoderRepresentations from Transformers (BERT) [53]. Figure 7 displays the dataﬂow to the models. We ﬁnd that the Random Forest model is the most eﬀectivefor our task. As the dataset is imbalanced and the trade-oﬀ between true andfalse positive rates associated with classiﬁcation, we choose to compare modelsusing the area under the receiver operating characteristic (ROC) curve (AUC)[54, 55]. Thus, a random baseline will score 50% on AUC ROC. We use 5-foldcross-validation for estimation and we run all models three times. Therefore,15 iterations in total for each model. We also standardize emotion and publicconcern features to have zero mean and unit variance.

Results:

Table 2 shows the classiﬁcation accuracy for our models. Withthe RF model trained on all available features, we achieve a mean accuracy of96% AUC ROC with 0.2% as standard deviation. The low value of standarddeviation indicates that the model is robust. All model shows that the mostimportant feature for the classiﬁcation is clean text. We can note that our19 nput Conv_1 Convolution(dropout=0.2) Max-Pooling (2 x 2) OutputFully-Connected LSTM Layer Fully-Connected Neural NetworkSoftmax activationPoliticsHealthResearchNews

Figure 8: Random neural network framework. best model, that is, RF is trained with the TF-IDF embedding for the wordsa user used in their tweets. We also see that public concerns and emotions areimportant characteristics for classifying a tweet in diﬀerent clusters.

7. Conclusion

With a quest to understand the role of various leaders during COVID-19, westudy a large number of tweets using techniques such as network analysis, textanalysis and sentiment analysis. On the basis of network analysis, users are cat-egorized into four diﬀerent clusters, namely research , news , health and politics .The results using text analysis shows that leaders of diﬀerent clusters are focus-ing on various public concerns. In particular, researchers about understandingthe pandemic symptoms and develop vaccination; news about travel and hy-giene; health organization about hygiene; and political people about travel andhygiene. Sentiment analysis indicates that emotions such as anticipation, fear, Table 2: Model mean AUC ROC and standard deviation for various dataset features. Forexample, the BERT model with the dataset (Text+Emotion) achieves 93.6% mean AUC ROCaccuracy with 0.8% standard deviation. sadness and trust are dominated for all clusters. Finally, we show that thefeatures presented in this work can be used to classify tweets in various clus-ters with an accuracy of up to 96% AUC ROC. Our ﬁndings suggest that tweetsthemselves carry some features which can be used to identify a user’s profession.Note that the Twitter stream is ﬁltered following Twitter’s API documen-tation; hence the tweets analyzed here still constitute a representative subset ofthe stream as opposed to the entire stream. For the future, we intend to analyzethe tweets’ data to a greater extent for diﬀerent category users to understandtheir pattern. Future work should also explore more sophisticated models andmore deeply analyze how the writing style of various cluster can be exploited ingreater depth to capture their signature in their tweets.

Acknowledgment

This research was funded by ERDF via the IT Academy Research Pro-gramme and SoBigData++.