Studying Leaders & Their Concerns Using Online Social Media During The Times Of Crisis -- A COVID Case Study
SStudying Leaders During Times of Crisis Using OnlineSocial Media - A COVID Case Study
Rahul Goel a,1 , Rajesh Sharma a a Institute of Computer ScienceUniversity of Tartu, Estonia
Abstract
Online Social media (OSM) has become a primary platform for discussion on di-verse topics. Even famous and public figures often express their views on varioustopics through OSM platforms. Novel Coronavirus officially called COVID-19,which has become a pandemic and has created a crisis in human history, is onesuch topic that has attracted a lot of attention on Twitter in recent times. Inthis work, we analyze 29 million tweets spanning across three months, to studyhighly influential users, which we called as leaders. We identify these leadersusing social network analysis and analyze their tweets using text analysis tech-niques. We group these leaders in four clusters, namely research , news , health and politics . Our analysis shows that i) all clusters show a similar amount of fear in their tweets, ii) researchers and news cluster display more sadness com-pared to others and, iii) health organizations and politicians try to gain public trust . The text analysis shows that the researchers are more concerned aboutunderstanding symptoms and developing vaccination ; news and politicians aremainly discussing about travel and hygiene ; and health organizations focuseson hygiene . Our descriptive analysis helps us to extract various features that weused to classify tweets among the four clusters with an accuracy of 96% AUCROC. Keywords:
Twitter, Leaders, COVID-19, WordCloud, NLP. Corresponding author: [email protected]
Preprint submitted to Elsevier January 11, 2021 a r X i v : . [ c s . S I] J a n . Introduction A leader is a person who can reach and inspire a large number of people withhis/her opinions. Crisis times provide an opportunity to test these leaders andassess their ability to take timely and necessary actions. Although the leadersare often prepared for the anticipated crisis scenario, the true challenge arisesin unforeseen circumstances. Starting from Arab spring [1, 2], to hurricane[3], to financial crisis [4], leaders have used social media platforms to share theiropinions and supervision. Such supervisions provide support and trust to peopleto overcome the crisis situation. Social media platforms help these leaders toefficiently communicate their message with a large number of crisis-affectedpeople.OSM platforms such as Facebook , Instagram , Twitter , etc. have becomean important part of individuals. The nature of these platforms allow users tobroadcast their opinions, discuss current issues, and express about things thatinfluence their daily life. In fact Politicians and famous public figures also useonline social media (OSM) platforms as a medium for sharing their opinionsand getting a sense of general view of how people think about them. Thesefamous users, which have considerable followers on OSM often function as aleaders of OSM platforms as they can communicate with a large number ofpeople effectively and influence them with their opinions.In this work, we study Twitter, as a representative of OSM, for understand-ing leaders during the time of crisis by taking COVID-19 as a case study. Wedefine Leaders as users with high PageRank values. We group leaders into clus-ters and study them by analyzing their tweets. This reveals the inclination andconcerns of the various clusters during COVID-19 (refer Section 5 for detail).Given the fact that the COVID-19 pandemic affected more than 180 countries https://twitter.com/ Note: In this work, “leader” represents a Twitter handle and “cluster” represents a groupof Twitter handles. .In this paper, we analyze 29 million tweets from 6 million unique users,collected using Twitter’s streaming API and Tweepy by using COVID-19 re-lated keywords (“coronavirus”, “coronovirusoutbreak” and “COVID-19”). Ourdataset spans from February 01, 2020 to May 02, 2020 to covers the followingresearch directions:1. Identifying leaders and their concerns:
We analyze our dataset toidentify the leaders during COVID-19 crisis using social network analysistechniques. We present leaders in four leading clusters namely research , news , health , and politics . Our finding shows that leaders are mostlyconcern about five major topics that are related to (1) symptoms ; (2) vaccination ; (3) hygiene ; (4) travel ; and (5) pandemic (Section 4).2. Leaders alignment towards concerns:
We use text analysis techniqueto understand the alignment of clusters with respect to various concernsduring the COVID-19 crisis. We observe that researchers are more con-cerned about understanding the symptoms and developing vaccination ; news about travel and hygiene ; health organization about hygiene ; andpoliticians about travel and hygiene (Section 5).3. Tweet classification :
On the insights of the leader’s tweet, emotionsand public concerns, we build a model to estimate the probability that atweet belongs to a specific cluster. Our findings can be used to classifytweets among clusters with an accuracy of 96% AUC ROC (Section 6).In the past, various studies have been performed to study leadership during
2. Related Work
Our work lies at the intersection of leadership, social media and epidemics.Thus, in this section, we first present literature concerning various works whichhave studied the leadership during a crisis. Next, we present related worksabout epidemics study with respect to OSM platforms.
The majority of works have studied the leadership crisis in the politicaldomain. For example, in [11], the authors highlighted that there is very littlescope of reform during crisis times against the common belief that these timesare best suited for change. A finding based on gulf leaders echoed this ideamore strongly as the leaders missed the opportunity to take stands conclusively[11]. In contrast, during crisis time individuals can even allow their democraticgovernment to propose technocratic guidelines for the safety of the society [12].In another study based on social media posts about the Egyptian revolution,the researchers reported that leaders are no more than connective leaders [13].Apart from political domains, leadership during crisis times has also beenstudied in administration settings. For example, in the aftermath of Hurricane4atrina in the US, the mixed approach studies reveal that school employeesexpect their leaders to create a proper plan [14], and take quick actions in ad-dition to regular engagement with them [15]. In a different study, the portrayalby media for administrators dealing with the Katrina crisis was criticized in [3].Industry settings are another area where leadership plays an essential rolein crisis management. In [4], the authors analyses more than 30 crises andinvestigated the role of the chief executive officer (CEO). The study pointed outthat CEOs need to be proactive at the start of the crisis. Simulations basedapproach was proposed in [16], for training leaders to handle crises effectively.For professionals working under uncertain circumstances and especially thoseexposed to crisis situations, there is a strong likelihood that they may experi-ence more negative emotions like stress and anxiety [17, 18]. Such emotionalreactions often arise from the potential risk of losing one’s job and from theneed to have immediate and effective responses [19, 20]. To handle these sit-uation, organizational leaders have a crucial role to play in allowing long-termpreparation for the benfits of employees and organization [21, 22].
Researchers have analysed Twitter data for studying various epidemic out-breaks such as Ebola [23, 24], H1N1 [25], flu [26], swine flu [27] etc. In [28], theauthors illustrate the potential of monitoring public health using social mediaduring the H1N1 pandemic. In another work [29], authors track the disease leveland public concern during the H1N1 pandemic in the US using Twitter data.A set of work have also focused on understanding the epidemic spreading pat-terns at different geographic locations using social media data. In [30], authorstrack the spread of influenza-A (H1N1) in India using Twitter data. Vaccinationand anti-viral uptake during H1N1 in the UK are studied in [31] using tweets.In [32], the authors studied the Nipah virus in India. In a very recent work [33],researchers compared Google and Twitter data for predicting influenza cases inGreece.To the best of our knowledge, this is the first work related to the coronavirus,5hich has analyzed online social media such as Twitter for identifying leadersacross different domains and their respective concerns.
3. Dataset Description
This section first provides information about the coronavirus (officially knownas COVID-19), and Twitter, an online social media platform. Next, the datacollection and pre-processing are discussed in the following subsections.
In the last decade, humanity has faced many different pandemics such asSARS, H1N1, and presently novel coronavirus (COVID-19) [34]. The severityof these pandemics can be understood by the death toll claimed by them. TheCOVID-19 pandemic, which started in December 2019 from Wuhan [35, 36],China has infected 21,160,149 individuals and claimed 759,166 (as of August14 th , 2020) deaths worldwide [5, 37]. Given the fact that this pandemic orig-inated from Wuhan, China and as of 14 August, 2020 China is not even onCOVID-19’s top 25 nations, shows the fast spread of this pandemic to othernations around the world. This impacted individuals life by keeping them inisolation, quarantine, emergency, or total lockdown. During these lockdownconditions, individuals are inclined to use social media platforms to share theirthoughts and experiences. By May 8, 2020, Twitter has attracted more than628 million tweets related to coronavirus which makes it a significant onlinesocial media platform for COVID-19 study . Twitteris an online social media platform classified as a microbloging sitewith which users can share messages, links to external Web sites, images, orvideos that are visible to their followers. Messages that are posted are short incontrast to traditional blogs. Blogging becomes ‘micro’ by shrinking it down to . Apart from Twit-ter, there are other microblogging platforms such as Tumblr , FourSquare ,Google+ , and LinkedIn of which Twitter is the most popular microbloggingsite. In the past, researches have shown that Twitter data is well suited forsentiment analysis and opinion mining [39]. The data collection is done using Twitter Streaming Application Program-ming Interface (API) and Python (see Figure 1). API is a tool that facilitatesthe interaction between computer programs and Web services. It enables thereal-time collection of data by tracking the live stream of public tweets. ManyWeb services provide developers with APIs for interacting with their services,and for programmatically accessing data. Python library such as tweepy alsoassist this task by providing functions that can track live streams of publictweets using hashtags or usernames.In this work, we use the Twitter Streaming API and Python package tweepy to download tweets related to various keywords regarding rapidly increasingcoronavirus disease (COVID-19). We collect and store a large sample of pub-lic tweets beginning February 1, 2020, that matched a set of pre-defined key-words: “coronavirus” and “coronovirusoutbreak”. The additional keyword islater added such as “COVID-19” (On February 11, 2020, COVID-19 was theofficial name given to novel coronavirus by WHO). The objective of using thesespecific keywords is to collect tweets belonging to COVID-19. The Twitter data https://developer.twitter.com/en/docs/basics/counting-characters https://foursquare.com/ http://plus.google.com http://linkedin.com/ ata collection using Twitter API and Python package tweepyTweets preprocessingClean textEmotion Analysis Public concerns Identification Figure 1: Tweets collection, preprocessing, emotions and public concern identification flow. we collect is stored in JSON format to make it easy for parsing and analysis.
Figure 2 shows the basic steps we took in preprocessing the tweet dataset.We start our tweet preprocessing by removing all the URL’s as they do notinclude any useful information for text analysis. Afterward, the tweets aresubjected to lowercasing. The emojis, emoticon , numbers, mentions (@) andhashtags ( are converted into its realmeaning to understand the real context of the tweet.Furthermore, we proceeded with removing every punctuation marks and avariety of different stopwords available in nltk package. We also perform thespell correction using SpellChecker package in Python. Using nltk package, we https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b https://github.com/rishabhverma17/sms slang translator/blob/master/slang.txt weets Remove URLs ToLower RemoveEmoticons Remove Emojis Remove HTML Tags RemoveNumbers RemoveMentions,Hashtags Convert Slang RemovePunctuations RemoveStopwords SpellingCorrection Stemming,LemmatizationRemove Top10 Freq WordsCleanText Remove Top10 Rare Words Figure 2: Tweets preprocessing.
Parameter Value
Time period 01-02-2020 to 02-05-2020
Table 1: Description of the dataset. did the stemming and lemmatization of the tweet text.The dataset’s size interms of total number of tweets, original tweets, retweets, number of distinctusers, and the number of distinct features are shown in Table 1.
4. Studying Leaders During COVID-19 Crisis
To understand the users’ interactions and to identify leaders, we first builda network by capturing the user’s social connections. We build the directed“retweet” network among users (Figure 3) where an edge a −→ b indicates thatuser a retweet user b . To identify users with similar alignment, we grouped theminto communities using the Louvain modularity method and for better repre-sentation each community is color-coded. Furthermore, to identify importantnodes in the network, we employ the PageRank algorithm [40], a well-knownalgorithm to characterize the centrality of nodes. PageRank reflects the impor-tance of a node in the retweet network, and a higher PageRank value representsinfluential users who can spread their tweet content to a community much faster9ompared to users with lower PageRank value. Therefore, these influential userswhich are the major sources of information in the network regarding COVID-19are termed as leaders .The top-200 leaders are categorized (in Figure 3, leaders are shown usingrelatively bigger nodes indicating higher PageRank values) into four clusters, Health Organizations , Politics , News Organizations and
Research Or-ganizations , depending upon their publicly available information such as pro-file description. Some of the examples of users under four different clustersare as follows, (i) health organizations : which includes health organizationsrelated Twitter handlers such as
World Health Organization (@WHO), Cen-ters for Disease Control and Prevention (@CDCgov), Director General WHO(@DrTedros), Global Health Strategies (@GHS) and World Health OrganizationWestern Pacific (@WHOWPRO) ; (ii) politics : includes Twitter profiles relatedto
U.S. President (@realDonaldTrump) and U.S. House Candidate ; (iii) newsorganizations : examples of Twitter handlers include
China Global TelevisionNetwork (@CGTNOfficial), Al Jazeera English (@AJEnglish) and
Global Times(@globaltimesnews) ; and (iv) research organizations : such as
ISCR . This in-dicate that, on Twitter, users are (re)sharing information from more reliablesources regarding COVID-19.Figure 3 shows that these leaders also interact with each other. CDC is at thecenter of the network and all other leaders such as WHO, CGTN, GHS, Director-general WHO and U.S House candidates are well connected with CDC. We alsonoticed that the U.S. President is closely connected with the director-generalof WHO. News handlers other than @CGTNOfficial are tightly connected withfew leaders. For example, Al Jazeera English (@AJEnglish) is closely followingthe tweets from WHO and WHO Western Pacific. We further analyze thedominant emotions for each cluster as well as their alignment towards variouspublic concerns during COVID-19.
Dominant emotions for various clusters:
We analyze tweets to discover10
DCDirector-GeneralWHO GHSU.S.President WHO U.S.HouseCandidate AlJazeeraEnglishGlobalTimesWHOWesternPacificCGTN ISCR
Figure 3:
COVID-19 Related Users Network.
Color-coded group represents communi-ties. To identify important users (or major sources of information), PageRank algorithm isused. each clusters’ dominant emotions. We use the
Syuzhet package [41] availablein R for emotions extractions. The Syuzhet package is designed to work on asentence level and repeated words do not count towards the emotion assignment.This means that annotation is performed on sentence level and not on paragraphlevel.We observe that
Anticipation , Fear , Sadness and
Trust are dominant emo-tions for all clusters compared to
Anger , Disgust , Joy and
Surprise (see Figure https://github.com/mjockers/syuzhet fear and anticipation in their tweets. Political and health organizations clustersindicate higher trust compared to others. On the other hand, researchers andnews clusters display greater sadness towards the COVID-19 situation. There-fore, we can infer that leaders are displaying fear, anticipation and sadness butat the same time, they are trying to build a trust environment among public. Figure 4: Dominant emotions in tweets from various leaders.
In addition, we perform the text analysis to understand the focus of theinterest of leaders during the COVID-19. We use the topic modeling techniqueto extract these interests. In particular, we use the Latent Dirichlet Allocation(LDA) [42] method with Gibbs sampling [43] which is an unsupervised andprobabilistic machine-learning topic modeling method that extracts topics fromtext data. The key assumption behind LDA is that each given text is a mix ofmultiple topics. 12e extract ten topics from tweets’ text. The LDA returns a set of ten wordsrelated to each identified topic (but not the title of the topic). We then as-sign appropriate topics to each set of words that closely reflect the topic at anabstract level. It has been observed that leaders are mostly discuss five top-ics related to various public concerns during the COVID-19. These concernsare: (1) disease symptoms , (2) disease vaccination , (3) disease countermeasuresor hygiene , (4) disease transmission during travel and (5) COVID-19 as pan-demic/epidemic . In the next section, we analyze the clusters’ alignment towardthese public concerns.
5. Clusters Alignment Towards Various Concerns
To study the alignment of various clusters towards five most discussed topicsor concerns discovered in Section 4 regarding rapidly increasing coronavirus dis-ease (COVID-19), we annotate each tweet using specific keywords. For example,disease symptoms tweets are annotated (i.e., using keyword symptom ); vacci-nation (i.e., using keywords vaccine and vaccination ); disease countermeasures(i.e., keywords hygiene , wash , hand and mask ); disease transmission duringtravel (i.e., keywords travel , flying , fly , airplane , flight and trip ) and pandemic(i.e., keywords pandemic and epidemic ). In this section, we start by uncoveringthe specifics of these public concerns to understand them in more detail. Next,we explain the alignment of leading clusters towards the above mentioned publicconcerns. Tweets analysis belonging to the symptoms category shows that the dailypercentage of observed COVID-19 tweets related to symptoms is higher thanother concerns which reflects that users on Twitter are more concerned aboutsymptoms of the disease. Furthermore, we extract the symptoms for coronavirusfrom tweets (see Figure 5a). The identified symptoms are fever, cough and cold.Leaders are also discussing the similarity of COVID-19 and influenza/flu, as13oth spreads among people in a similar way, that is, via respiratory dropletsfrom coughing or sneezing [44]. Leaders are calling the COVID-19 as an asymp-tomatic disease as the time between exposure and symptom onset is typically fivedays, but may range from two to fourteen days with mild fever among healthypeople [44]. In particular, we also observe that research cluster is more concernabout understanding the COVID-19 symptoms compared to other clusters (seeFigure 6). respiratory infection fever mild risk spreading cold c on t a c t asymptomatic days severe qua r an t i ne c ough sick sy m p t o m a t i c ho m e italy china care flu cdc infected rate transmission spread (a) Symptoms china spread human news e ff e c t i v e s a r s research pandemic death t r i a l s flu public world infected cure scientists t r ea t m en t cdc sta testing won influenza stop rate test (b) Vaccine m a sks prevent ethiopia stop ho m e sick time public touch touching sanitizer f l u protect s p r ead avoid water wash pandemic cough safe hand washyourhands soap cdc stay (c) Countermeasures a v o i d quarantine chinese traveling world korea pandemic restrictions canada iran ban spread japan risk india italy cdc newscancel i n f e c t ed wuhan kenya south government international (d) TravelFigure 5: Wordcloud igure 6: Leaders alignment towards various public concerns. The analysis of tweets belonging to the vaccination category indicates thatas currently there is no vaccine and no specific treatment for COVID-19 isavailable, leaders are discussing the effectiveness of flu vaccination for COVID-19 as both flu and COVID-19 cause respiratory disease (see Figure 5b). Wecan also infer that leaders are discussing intensely the ongoing research to cureCOVID-19. Therefore, we can infer that leaders on Twitter are very conscious ofthe symptoms of COVID-19 and also keeping an eye on the ongoing vaccinationresearch for COVID-19. As no vaccine is presently available, all clusters arecautiously tweeting regarding the vaccination (see Figure 6).
We identify the tweets belonging to countermeasures category using key-words such as hygiene, wash, hand and mask . Furthermore, we partition coun-termeasures category tweets into
Hygiene (i.e., keywords hygiene, wash, hand )and
Mask (i.e., keywords mask ). Note that the volume of tweets pertaining to15 ask is higher than
Hygiene before February 24, and after that leader started fo-cusing more on
Hygiene compared to
Mask . This indicates that initially, leaderswere tweeting about wearing masks but later they shifted their countermeasuresstrategy towards taking proper
Hygiene against COVID-19.To explore the countermeasures against COVID-19, we created a wordcloudfrom countermeasures category tweets (see Figure 5c) indicating prevention sug-gestions including handwashing , respiratory hygiene , self-isolation and self-quarantine . Hand wash using soap and water or sanitizer is highly dis-cussed for countermeasures. Handwashing is also recommended by the US Cen-ters for Disease Control and Prevention (CDC) to prevent the spread of thedisease. It recommends that people wash hands often with soap and water forat least 20 seconds, especially after going to the toilet or when hands are visiblydirty; before eating; and after blowing one’s nose, coughing, or sneezing. Itfurther recommended using an alcohol-based hand sanitizer with at least 60%alcohol by volume when soap and water are not readily available . The advicefrom CDC to avoid touching the eyes, nose, or mouth with unwashed handsand for respiratory hygiene , cover the mouth with masks and take precautionsin case of coughing are also highly tweeted.Tweeting percentage for Health organizations and politicians are higher re-garding hygiene compared to news and research clusters (see Figure 6). We canconclude that leaders on Twitter are spreading awareness towards taking properhygiene measures against COVID-19 issued by WHO and CDC.
We focus on traveling via flight to understand the effect of COVID-19 ontravel. We identify the tweets belonging to the travel category using keywordssuch as travel, flying, fly, airplane, flight and trip . To explore the effect ofCOVID-19 on traveling, we create a wordcloud from travel category tweets .Among clusters, News and political clusters are more focusing on travellingcompared to researchers and health organizations (see Figure 6).
As COVID-19 outbreak is first identified in Wuhan, China, in December2019 [35]. The World Health Organization (WHO) declared the outbreak a
Public Health Emergency of International Concern on January 30 th th researchers, news and health clusters are frequently using pandemic orepidemic while discussing about COVID-19 compared to politicians (see Figure6). To summarize, on analyzing the clusters’ alignment towards various publicconcerns (see Figure 6), we find that researchers are highly concerned aboutunderstanding the COVID-19 by studying its symptoms and development ofthe vaccination. News are discussing travel and hygiene.
Health organizationsare focusing on hygiene. Whereas, political people are highly concerned abouttravel and hygiene. This indicates that the different clusters are focused onspecific public concerns. This can be viewed as a positive approach, since variousclusters are focusing on specific issues as well as engaging with each other oncommon concerns.
6. Tweet Classification In Clusters
Next, taking insights from previous sections, we build a model to estimatethe likelihood that a tweet belongs to a specific cluster. The tweets percentage weet clean text Emotions Public concernsEmbedding ConcatenateMulti-label Classification (SVC, RF, NN or BERT)Output Figure 7: Flow diagram for features concatenation and model selection. and count belonging to different clusters are as follows:
News (36%; 15,289),
Health (33%; 14,023),
Research (18%; 7,635) and
Politics (13%; 5,521). Astweet distribution among clusters is imbalanced, this problem is represented asan imbalanced multi-label classification [46].
To illustrate the predictive power of various features sets as mentioned inthe earlier section, we define a series of models, each with a different feature set.We focus on three different types of features:1.
Tweet text:
This refers to the clean text extracted after preprocessingthe original tweets (see Section 3.4).2.
Emotions:
This refers to the emotions associated with respect to eachtweet (see Section 4).3.
Public concerns:
This corresponds to the public concerns revealed inSection 5. 18 .2. Experimental setup and results
We aim to estimate the likelihood that a tweet belongs to a specific cluster byusing the features mentioned in Section 6.1. As in Section 4, we filter the leadersand cluster them into four groups. We remove all user tweets that are not in anyof the leader clusters. We also deleted all blank tweets after tweet preprocessing.After applying these filters, we obtain a dataset containing 42,468 tweets.Since the dataset is imbalanced, we use Synthetic Minority Over-samplingTechnique (SMOTE) [47] to resolve this problem. SMOTE works by selectingexamples that are close in the feature space, drawing a line between the examplesin the feature space and drawing a new sample at a point along that line.Specifically, a random example from the minority class is first chosen. Then kof the nearest neighbors for that example is found (typically k=5). A randomlyselected neighbor is chosen and a synthetic example is created at a randomlyselected point between the two examples in feature space.We experiment with several classification models, including Support Vec-tor Classifier (SVC) [48, 49], and Random Forests (RF) [50], Random neuralnetwork (NN, see Figure 8 for framework) [51, 52] and Bidirectional EncoderRepresentations from Transformers (BERT) [53]. Figure 7 displays the dataflow to the models. We find that the Random Forest model is the most effectivefor our task. As the dataset is imbalanced and the trade-off between true andfalse positive rates associated with classification, we choose to compare modelsusing the area under the receiver operating characteristic (ROC) curve (AUC)[54, 55]. Thus, a random baseline will score 50% on AUC ROC. We use 5-foldcross-validation for estimation and we run all models three times. Therefore,15 iterations in total for each model. We also standardize emotion and publicconcern features to have zero mean and unit variance.
Results:
Table 2 shows the classification accuracy for our models. Withthe RF model trained on all available features, we achieve a mean accuracy of96% AUC ROC with 0.2% as standard deviation. The low value of standarddeviation indicates that the model is robust. All model shows that the mostimportant feature for the classification is clean text. We can note that our19 nput Conv_1 Convolution(dropout=0.2) Max-Pooling (2 x 2) OutputFully-Connected LSTM Layer Fully-Connected Neural NetworkSoftmax activationPoliticsHealthResearchNews
Figure 8: Random neural network framework. best model, that is, RF is trained with the TF-IDF embedding for the wordsa user used in their tweets. We also see that public concerns and emotions areimportant characteristics for classifying a tweet in different clusters.
7. Conclusion
With a quest to understand the role of various leaders during COVID-19, westudy a large number of tweets using techniques such as network analysis, textanalysis and sentiment analysis. On the basis of network analysis, users are cat-egorized into four different clusters, namely research , news , health and politics .The results using text analysis shows that leaders of different clusters are focus-ing on various public concerns. In particular, researchers about understandingthe pandemic symptoms and develop vaccination; news about travel and hy-giene; health organization about hygiene; and political people about travel andhygiene. Sentiment analysis indicates that emotions such as anticipation, fear, Table 2: Model mean AUC ROC and standard deviation for various dataset features. Forexample, the BERT model with the dataset (Text+Emotion) achieves 93.6% mean AUC ROCaccuracy with 0.8% standard deviation. sadness and trust are dominated for all clusters. Finally, we show that thefeatures presented in this work can be used to classify tweets in various clus-ters with an accuracy of up to 96% AUC ROC. Our findings suggest that tweetsthemselves carry some features which can be used to identify a user’s profession.Note that the Twitter stream is filtered following Twitter’s API documen-tation; hence the tweets analyzed here still constitute a representative subset ofthe stream as opposed to the entire stream. For the future, we intend to analyzethe tweets’ data to a greater extent for different category users to understandtheir pattern. Future work should also explore more sophisticated models andmore deeply analyze how the writing style of various cluster can be exploited ingreater depth to capture their signature in their tweets.
Acknowledgment
This research was funded by ERDF via the IT Academy Research Pro-gramme and SoBigData++.