Characterizing Twitter Interaction during COVID-19 pandemic using Complex Networks and Text Mining
CCharacterizing Twitter Interaction duringCOVID-19 pandemic usingComplex Networks and Text Mining
Josimar Edinson Chire Saire
Institute of Mathematics and Computer Science (ICMC)University of S˜ao Paulo (USP)
S˜ao Carlos, SP, [email protected]
Abstract —The outbreak of covid-19 started many months ago,the reported origin was in Wuhan Market, China. Fastly, thisvirus was propagated to other countries because the access tointernational travels is affordable and many countries have adistance of some flight hours, besides borders were a constantflow of people. By the other hand, Internet users have the habitsof sharing content using Social Networks and issues, problems,thoughts about Covdid-19 were not an exception. Therefore, itis possible to analyze Social Network interaction from one city,country to understand the impact generated by this global issue.South America is one region with developing countries withchallenges to face related to Politics, Economy, Public Healthand other. Therefore, the scope of this paper is to analyze theinteraction on Twitter of South American countries and charac-terize the flow of data through the users using Complex Networkrepresentation and Text Mining. The preliminary experimentsintroduces the idea of existence of patterns, similar to ComplexSystems. Besides, the degree distribution confirm the idea ofhaving a System and visualization of Adjacency Matrices showthe presence of users’ group publishing and interacting togetherduring the time, there is a possibility of identification of robotssending posts constantly.
Index Terms —Twitter Analytics, Text Mining, Complex Net-work, Covid-19, Data Science, South America, Pandemic
I. I
NTRODUCTION
Nowadays, the use of Social Networks to communicate,share information, thoughts, ideas is very common. Usually,people is creating posts, writing during the day and taggingfriends, colleagues, etc. Therefore, all this flow of data canrepresent the actual status of the citizens. Besides consideringpandemic covid-19, users can reflect what they are thinking,feeling in front of the global issue related to the pandemicin their cities, countries. In consequence, this behaviour canbe analyzed to monitor the situation of the population, healtharea as Infomediology studies the behaviour through data andInfovelliance is the application using Computational tools anddirected/undirected sources of data.Actually, there are many studies related to covid-19 toanalyze people’s behaviour using Social Networks, in partic-ular Twitter because this Social Network presents facilities toaccess data and the quantity of data can be representative, i.e.top concerns of users [1] and focus in countries as scope of study: Italy [2], Peru [3] , United Kingdom, Unites States [4],Mexico( citeChireSaire2020.05.07.20094466, Colombia [5],United States [6], Ghana [7], France [8]. These studies usesa Data Mining approach and Natural Language Processingtechniques to describe and understand the phenomenon inmany levels: public health, social, mental health and more. Bycontrast, this flow of data can include misinformation producedintentionally through robots. One approach to represent andanalyze this networks of users exchanging data is to useComplex Networks. Complex Networks had many applicationsin different areas: Physics, Biology and Social Sciences. Then,Complex Networks is a capable representation to study theinteraction of users, i.e. [9] studies the top users in Spain usingTwitter as source data.The contribution of this paper: • Select South America as scope and study the covid-19pandemic influence on this region, open the possibilitiesof studying this phenomenon and provide a proposal forthis analysis, section II ) • Introduce Complex Networks to study this phenomenongetting data from Social Networks and find pattern relatedto Complex Systems. • Find a affordable way to identify network of users withconstant flowing of data, beside the possibility of findingrobots, fake users through this mechanism, section III.II. P
ROPOSAL
This paper is analyzing the interaction of South Americanusers where Spanish is official language, through TwitterSocial Network. Considering Internet Access and density ofpopulation, the capital of each country were selected for theanalysis.
A. Dataset
The dataset is a collection of tweets from 08 March to 11July using Twitter API, the table Tab. 1 describes the dataset.
B. Complex Network Construction
The Complex Network was created considering the nextcriterion: a r X i v : . [ c s . C Y ] S e p ABLE I: Description of Dataset
Capital Latitude Longitude Tweets UniqueUsersArgentina
Buenos Aires -34.583 -58.667 3872212 749842
Bolivia
La Paz -16.5 -68.15 134605 21850
Chile
Santiago -33.45 -70.667 3146307 309351
Colombia
Bogota 4.6 -74.083 1688045 315794
Ecuador
Quito -0.217 -78.5 836017 102979
Paraguay
Asuncion -25.267 -57.667 1628146 156384
Peru
Lima -12.05 -77.05 2054415 287261
Uruguay
Montevideo -34.85 -56.167 826405 192381
Venezuela
Caracas 10.483 -66.867 4102893 305684 • Pick the N users with more posts during the period ofstudy • Search @tag mentions of users inside of the tweets tofind users connection • Find a global list of users and create a set from thiselements to avoid duplicated users • Create a global text with text for each country to find theusers and count them • Create the edges considering the N top users and the setof users with the frequency as weightIII. E XPERIMENTS AND R ESULTS
This section explains the performed experiments to describe,analyze and understand the interaction of South AmericanTwitter users.
A. Firs Experiment
The first experiment used N = 500, N = 100 and createda directed graph with no associated weights. The graphic 1is presenting the results for Peru country, then it is possibleto notice a kind of reticular pattern. Therefore, there is thepossibility of existence of users’ group. Fig. 1: Peru For the next experiment, the frequency of @tag users isconsidered to set the weight of the edges.
B. Second Experiment
The first experiment used N = 2000, N = 200 and createda directed graph with no weights related. The graphic 2introduces the result for Argentina, the number of edges forthis Complex Network is higher than 20,000 and the weightare very disperse, then a log scale is introduced. In spiteof the adaptation it is not possible to perceive the edges orconnections between users. Fig. 2: PeruFor the previous reason, a filtering is performed consideringthe degree distribution(see Figure 3) and only edges withweight higher than 200 are considered. Besides, it is importantto notice the presence of a distribution similar to Levy’sDistribution.The results for Argentina, Bolivia, Chile, Colombia,Ecuador, Paraguay, Peru, Uruguay and Venezuela, besides acolor bar is showed to express the strength or weight of eachedge, see Fig. 4.Analyzing the present results, it is possible identify networkof users in each country. Besides, Bolivia is a country with lessnumber of users therefore the filtered matrix is smaller thanother countries. At the same time is possible to identify the linestructures for Colombia, Ecuador, Peru, Paraguay, Uruguayand Venezuela. This lines can mean or represent a group ofusers with constant tagging between themselves. Venezuelacalls our attention, considering the size of the filtered matrixand the extension of the lines, it is possible to identify two
250 500 750 1000 1250 1500 175010 (a) Argentina, Bolivia, Chile (b) Colombia, Ecuador, Peru (c) Paraguay, Uruguay, Venezuela Fig. 3: South American countries - Degree DistributionTABLE II: Number of communities per country
Number ofcommunitiesArgentina 34Bolivia 39Chile 19Colombia 47Ecuador 30Peru 36Paraguay 28Uruguay 37Venezuela 26 big groups of 100 and 150 users tagging themselves durin theperiod of study.
C. Finding communities
To conduct a deep analysis about the existence of users’groups an algorithm to find communities is performed overthe complete graph. The results are presented in Table II andFigure 5.A previous hypothesis about Venezuela can be confirmedconsidering the number of present communities in the inter-action of users. And, the images representing the communitiescan express a big concentration of nodes in the center respec-tively showing the tagging of some specific users.IV. C
ONCLUSION
This preliminary study to analyze Twitter interaction ofSouth American around covid-19 pandemic shows promissoryresults. First, a text mining approach is used to process textand find users. Second a proposal is performed in experimentssections showing the viability of creating Complex Networkwith the proposal. Finally, visualisation techniques are pro-posed to analyze the matrix adjacency of each country, a filtering process to select most representative behaviour anddiscovering of communities, Venezuela arises a concern aboutintentional group of users publishing content during this periodof study. V. F
UTURE W ORK
Future work, involves to perform an analysis of every weekwith the aim of finding changes in the interaction, numberof top users and describe all this behaviour using ComplexNetworks and features related to degree measurements of thenodes, edges. A
CKNOWLEDGMENTS
The author wants to mention Research4Tech, an ArtificialIntelligence community of Latin American Researchers forpromoting Science and collaboration in Latin American coun-tries, his roles as integrator between Professional, Researchers,Technology communities is key to develop the Latin Americanregion as a strong body.R
EFERENCES[1] A. Abd-Alrazaq, D. Alhuwail, M. Househ, M. Hamdi, and Z. Shah, “Topconcerns of tweeters during the covid-19 pandemic: infoveillance study,”
Journal of medical Internet research , vol. 22, no. 4, p. e19016, 2020.[2] E. De Santis, A. Martino, and A. Rizzi, “An infoveillance system fordetecting and tracking relevant topics from italian tweets during the covid-19 event,”
IEEE Access , vol. 8, pp. 132 527–132 538, 2020.[3] J. E. Chire Saire and J. Oblitas, “Covid19 surveillance in peruon april using text mining,” medRxiv medRxiv medRxiv medRxiv medRxiv
20 40 60 80 100 120020406080100120 050010001500200025003000 (a) Argentina, Bolivia, Chile (b) Colombia, Ecuador, Peru (c) Paraguay, Uruguay, Venezuela
Fig. 4: South American countries - Matrix Adjacency a) Argentina, Bolivia, Chile(b) Colombia, Ecuador, Peru(c) Paraguay, Uruguay, Venezuelaa) Argentina, Bolivia, Chile(b) Colombia, Ecuador, Peru(c) Paraguay, Uruguay, Venezuela