[PDF] Characterizing Twitter Interaction during COVID-19 pandemic using Complex Networks and Text Mining

Abstract

The outbreak of covid-19 started many months ago, the reported origin was in Wuhan Market, China. Fastly, this virus was propagated to other countries because the access to international travels is affordable and many countries have a distance of some flight hours, besides borders were a constant flow of people. By the other hand, Internet users have the habits of sharing content using Social Networks and issues, problems, thoughts about Covdid-19 were not an exception. Therefore, it is possible to analyze Social Network interaction from one city, country to understand the impact generated by this global issue. South America is one region with developing countries with challenges to face related to Politics, Economy, Public Health and other. Therefore, the scope of this paper is to analyze the interaction on Twitter of South American countries and characterize the flow of data through the users using Complex Network representation and Text Mining. The preliminary experiments introduces the idea of existence of patterns, similar to Complex Systems. Besides, the degree distribution confirm the idea of having a System and visualization of Adjacency Matrices show the presence of users' group publishing and interacting together during the time, there is a possibility of identification of robots sending posts constantly.

Full PDF

CCharacterizing Twitter Interaction duringCOVID-19 pandemic usingComplex Networks and Text Mining

Josimar Edinson Chire Saire

Institute of Mathematics and Computer Science (ICMC)University of S˜ao Paulo (USP)

S˜ao Carlos, SP, [email protected]

Abstract —The outbreak of covid-19 started many months ago,the reported origin was in Wuhan Market, China. Fastly, thisvirus was propagated to other countries because the access tointernational travels is affordable and many countries have adistance of some ﬂight hours, besides borders were a constantﬂow of people. By the other hand, Internet users have the habitsof sharing content using Social Networks and issues, problems,thoughts about Covdid-19 were not an exception. Therefore, itis possible to analyze Social Network interaction from one city,country to understand the impact generated by this global issue.South America is one region with developing countries withchallenges to face related to Politics, Economy, Public Healthand other. Therefore, the scope of this paper is to analyze theinteraction on Twitter of South American countries and charac-terize the ﬂow of data through the users using Complex Networkrepresentation and Text Mining. The preliminary experimentsintroduces the idea of existence of patterns, similar to ComplexSystems. Besides, the degree distribution conﬁrm the idea ofhaving a System and visualization of Adjacency Matrices showthe presence of users’ group publishing and interacting togetherduring the time, there is a possibility of identiﬁcation of robotssending posts constantly.

Index Terms —Twitter Analytics, Text Mining, Complex Net-work, Covid-19, Data Science, South America, Pandemic

I. I

NTRODUCTION

Nowadays, the use of Social Networks to communicate,share information, thoughts, ideas is very common. Usually,people is creating posts, writing during the day and taggingfriends, colleagues, etc. Therefore, all this ﬂow of data canrepresent the actual status of the citizens. Besides consideringpandemic covid-19, users can reﬂect what they are thinking,feeling in front of the global issue related to the pandemicin their cities, countries. In consequence, this behaviour canbe analyzed to monitor the situation of the population, healtharea as Infomediology studies the behaviour through data andInfovelliance is the application using Computational tools anddirected/undirected sources of data.Actually, there are many studies related to covid-19 toanalyze people’s behaviour using Social Networks, in partic-ular Twitter because this Social Network presents facilities toaccess data and the quantity of data can be representative, i.e.top concerns of users [1] and focus in countries as scope of study: Italy [2], Peru [3] , United Kingdom, Unites States [4],Mexico( citeChireSaire2020.05.07.20094466, Colombia [5],United States [6], Ghana [7], France [8]. These studies usesa Data Mining approach and Natural Language Processingtechniques to describe and understand the phenomenon inmany levels: public health, social, mental health and more. Bycontrast, this ﬂow of data can include misinformation producedintentionally through robots. One approach to represent andanalyze this networks of users exchanging data is to useComplex Networks. Complex Networks had many applicationsin different areas: Physics, Biology and Social Sciences. Then,Complex Networks is a capable representation to study theinteraction of users, i.e. [9] studies the top users in Spain usingTwitter as source data.The contribution of this paper: • Select South America as scope and study the covid-19pandemic inﬂuence on this region, open the possibilitiesof studying this phenomenon and provide a proposal forthis analysis, section II ) • Introduce Complex Networks to study this phenomenongetting data from Social Networks and ﬁnd pattern relatedto Complex Systems. • Find a affordable way to identify network of users withconstant ﬂowing of data, beside the possibility of ﬁndingrobots, fake users through this mechanism, section III.II. P

ROPOSAL

This paper is analyzing the interaction of South Americanusers where Spanish is ofﬁcial language, through TwitterSocial Network. Considering Internet Access and density ofpopulation, the capital of each country were selected for theanalysis.

A. Dataset

The dataset is a collection of tweets from 08 March to 11July using Twitter API, the table Tab. 1 describes the dataset.

B. Complex Network Construction

The Complex Network was created considering the nextcriterion: a r X i v : . [ c s . C Y ] S e p ABLE I: Description of Dataset

Capital Latitude Longitude Tweets UniqueUsersArgentina

Buenos Aires -34.583 -58.667 3872212 749842

Bolivia

La Paz -16.5 -68.15 134605 21850

Chile

Santiago -33.45 -70.667 3146307 309351

Colombia

Bogota 4.6 -74.083 1688045 315794

Ecuador

Quito -0.217 -78.5 836017 102979

Paraguay

Asuncion -25.267 -57.667 1628146 156384

Peru

Lima -12.05 -77.05 2054415 287261

Uruguay

Montevideo -34.85 -56.167 826405 192381

Venezuela

Caracas 10.483 -66.867 4102893 305684 • Pick the N users with more posts during the period ofstudy • Search @tag mentions of users inside of the tweets toﬁnd users connection • Find a global list of users and create a set from thiselements to avoid duplicated users • Create a global text with text for each country to ﬁnd theusers and count them • Create the edges considering the N top users and the setof users with the frequency as weightIII. E XPERIMENTS AND R ESULTS

This section explains the performed experiments to describe,analyze and understand the interaction of South AmericanTwitter users.

A. Firs Experiment

The ﬁrst experiment used N = 500, N = 100 and createda directed graph with no associated weights. The graphic 1is presenting the results for Peru country, then it is possibleto notice a kind of reticular pattern. Therefore, there is thepossibility of existence of users’ group. Fig. 1: Peru For the next experiment, the frequency of @tag users isconsidered to set the weight of the edges.

B. Second Experiment

The ﬁrst experiment used N = 2000, N = 200 and createda directed graph with no weights related. The graphic 2introduces the result for Argentina, the number of edges forthis Complex Network is higher than 20,000 and the weightare very disperse, then a log scale is introduced. In spiteof the adaptation it is not possible to perceive the edges orconnections between users. Fig. 2: PeruFor the previous reason, a ﬁltering is performed consideringthe degree distribution(see Figure 3) and only edges withweight higher than 200 are considered. Besides, it is importantto notice the presence of a distribution similar to Levy’sDistribution.The results for Argentina, Bolivia, Chile, Colombia,Ecuador, Paraguay, Peru, Uruguay and Venezuela, besides acolor bar is showed to express the strength or weight of eachedge, see Fig. 4.Analyzing the present results, it is possible identify networkof users in each country. Besides, Bolivia is a country with lessnumber of users therefore the ﬁltered matrix is smaller thanother countries. At the same time is possible to identify the linestructures for Colombia, Ecuador, Peru, Paraguay, Uruguayand Venezuela. This lines can mean or represent a group ofusers with constant tagging between themselves. Venezuelacalls our attention, considering the size of the ﬁltered matrixand the extension of the lines, it is possible to identify two

250 500 750 1000 1250 1500 175010 (a) Argentina, Bolivia, Chile (b) Colombia, Ecuador, Peru (c) Paraguay, Uruguay, Venezuela Fig. 3: South American countries - Degree DistributionTABLE II: Number of communities per country

Number ofcommunitiesArgentina 34Bolivia 39Chile 19Colombia 47Ecuador 30Peru 36Paraguay 28Uruguay 37Venezuela 26 big groups of 100 and 150 users tagging themselves durin theperiod of study.

C. Finding communities

To conduct a deep analysis about the existence of users’groups an algorithm to ﬁnd communities is performed overthe complete graph. The results are presented in Table II andFigure 5.A previous hypothesis about Venezuela can be conﬁrmedconsidering the number of present communities in the inter-action of users. And, the images representing the communitiescan express a big concentration of nodes in the center respec-tively showing the tagging of some speciﬁc users.IV. C

ONCLUSION

This preliminary study to analyze Twitter interaction ofSouth American around covid-19 pandemic shows promissoryresults. First, a text mining approach is used to process textand ﬁnd users. Second a proposal is performed in experimentssections showing the viability of creating Complex Networkwith the proposal. Finally, visualisation techniques are pro-posed to analyze the matrix adjacency of each country, a ﬁltering process to select most representative behaviour anddiscovering of communities, Venezuela arises a concern aboutintentional group of users publishing content during this periodof study. V. F

UTURE W ORK

Future work, involves to perform an analysis of every weekwith the aim of ﬁnding changes in the interaction, numberof top users and describe all this behaviour using ComplexNetworks and features related to degree measurements of thenodes, edges. A

CKNOWLEDGMENTS

The author wants to mention Research4Tech, an ArtiﬁcialIntelligence community of Latin American Researchers forpromoting Science and collaboration in Latin American coun-tries, his roles as integrator between Professional, Researchers,Technology communities is key to develop the Latin Americanregion as a strong body.R

EFERENCES[1] A. Abd-Alrazaq, D. Alhuwail, M. Househ, M. Hamdi, and Z. Shah, “Topconcerns of tweeters during the covid-19 pandemic: infoveillance study,”

Journal of medical Internet research , vol. 22, no. 4, p. e19016, 2020.[2] E. De Santis, A. Martino, and A. Rizzi, “An infoveillance system fordetecting and tracking relevant topics from italian tweets during the covid-19 event,”

IEEE Access , vol. 8, pp. 132 527–132 538, 2020.[3] J. E. Chire Saire and J. Oblitas, “Covid19 surveillance in peruon april using text mining,” medRxiv medRxiv medRxiv medRxiv medRxiv

20 40 60 80 100 120020406080100120 050010001500200025003000 (a) Argentina, Bolivia, Chile (b) Colombia, Ecuador, Peru (c) Paraguay, Uruguay, Venezuela

Fig. 4: South American countries - Matrix Adjacency a) Argentina, Bolivia, Chile(b) Colombia, Ecuador, Peru(c) Paraguay, Uruguay, Venezuelaa) Argentina, Bolivia, Chile(b) Colombia, Ecuador, Peru(c) Paraguay, Uruguay, Venezuela