Characterizing information leaders in Twitter during COVID-19 crisis
CCharacterizing information leaders in Twitter during COVID-19 crisis
David Pastor-Escuredo and
Carlota Tarazona
Abstract. Information is key during a crisis such as the current COVID-19 pandemic as it greatly shapes people’s opinion, behaviour and even their psychological state. It has been acknowledged from the Secretary-General of the United Nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. Infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. For instance, infodemics can lead to hatred between population groups that fragment the society influencing its response or result in negative habits that help the pandemic propagate. On the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. We propose a framework to characterize leaders in Twitter based on the analysis of the social graph derived from the activity in this social network. Centrality metrics are used to identify relevant nodes that are further characterized in terms of users’ parameters managed by Twitter. We then assess the resulting topology of clusters of leaders. Although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information. INTRODUCTION
Misinformation and fake news are a recurrent problem of our digital era [1-3] . The volume of misinformation and its impact grows during large events, crises and hazards [4] . When misinformation turns into a systemic pattern it becomes an infodemic [5,6] . Infodemics are frequent specially in social networks that are distributed systems of information generation and spreading. For this to happen, the content is not the only variable but the structure of the social network and the behavior of relevant people greatly contribute [6] . During a crisis such as the current COVID-19 pandemic, information is key as it greatly shapes people’s opinion, behaviour and even their psychological state [7-9] . However, the greater the impact the greater the risk [10] . It has been acknowledged from the Secretary-General of the United Nations that the infodemic of misinformation is an important secondary crisis produced by the pandemic. During a crisis, time is critical, so people need to be informed at the right time [11,12] . Furthermore, information during a crisis leads to action, so population needs to be properly informed Center of Innovation and Technology for Development, Technical University Madrid, Spain LifeD Lab, Madrid, Spain to act right [13] . Thus, infodemics can amplify the real negative consequences of the pandemic in different dimensions: social, economic and even sanitary. For instance, infodemics can lead to hatred between population groups [14] that fragment the society influencing its response or result in negative habits that help the pandemic propagate. On the contrary, reliable and trustful information along with messages of hope and solidarity can be used to control the pandemic, build safety nets and help promote resilience and antifragility. To fight misinformation and hate speech,content-based filtering is the most common approach taken [6,15-17] . The availability of Deep Learning tools makes this task easier and scalable [18-20] . Also, positioning in search engines is key to ensure that misinformation does not dominate the most relevant results of the searches. However, in social media, besides content, people’s individual behavior and network properties, dynamics and topology are other relevant factors that determine the spread of information through the network [21-23] . We propose a framework to characterize leaders in Twitter based on the analysis of the social graph derived from the activity in this social network [24] . Centrality metrics are used to identify relevant nodes that are further characterized in terms of users’ parameters managed by Twitter [25-29] . Although this tool may be used for surveillance of individuals, we propose it as the basis for a constructive application to empower users with a positive influence in the collective behaviour of the network and the propagation of information [27,30] . MATERIALS AND METHODS 2.1
Data
Tweets were retrieved using the real-time streaming API of Twitter. Two concurrent filters were used for the streaming: location and keywords. Location was restricted to a bounding box enclosing the city of Madrid [-3.7475842804, 40.3721683069, -3.6409114868, 40.4886258195] whereas the keywords were basic terms naming the pandemic: ['coronavirus', 'Coronavirus', ' *These authors equally contributed email: [email protected] nd and May 4 th . Network construction
Each tweet was analyzed to extract mentioned users, retweeted users, quoted users or replied users. For each of these events the corresponding nodes were added to an undirected graph as well as a corresponding edge initializing the edge property “flow”. If the edge was already created, the property “flow” was incremented. This procedure was repeated for each tweet registered. The network was completed by adding the property “inverse flow”, that is 1/flow, to each edge. The resulting network featured 107544 nodes and 116855 edges.
Relevance metrics
Table 1.
List of descriptors
To compute centrality metrics the network described above was filtered. First, users with a node degree (number of edges connected to the note) less than a given threshold (experimentally set to 3) were removed from the network as well as the edges connected to those nodes. The reason of this filtering was to reduce computation cost as algorithms for centrality metrics have a high computation cost and also removed poorly connected nodes as the network built comes from sparse data (retweets, mentions and quotes). However, it is desirable to minimize the amount of filtering performed to study large scale properties within the network. The resulting network featured 15845 nodes and 26837 edges. Additionally the network was filtered to be connected which is a requirement for the computation of several of the centrality metrics described bellow. For this purpose the subnetworks connected were identified, selecting the largest connected network as the target network for analysis. The resulting network featured 12006 nodes and 25316 edges. Several centrality metrics were computed: cfbetweenness, betweenness, closeness, cfcloseness, eigenvalue, degree and load. Each of this centrality metric highlights a specific relevance property of a node with regards to the whole flow through the network. Descriptors explanations are summarized in Table 1. Besides the network-based metrics, Twitter user’ parameters were collected: followers, following and favorites so the relationships with relevance metrics could be assessed.
Statistics
We applied several statistical tools to characterize users in terms of the relevance metrics. We also implemented visualizations of different variables and the network for a better understanding of leading nodes characterization and topology. RESULTS 3.1
Descriptors correlation
We compared the relevance in the network derived from the centrality metrics with the user’ profile variables of Twitter: number of followers, number of following and retweet count. Figure 1 shows a scatter plots matrix among all variables. Principal diagonal of the figure shows the distribution of each variable which are normally characterized by a high concentration in low values and a very long tail of the distribution. These distributions imply that few nodes concentrate most part of the relevance within the network. More surprisingly, same distributions are observed for Twitter user’ parameters such as number of followers or friends (following).
Figure 1.
Matrix of histograms and scatter plots among all variables. Lleft to right and from the top to the bottom: cfbetweenness, betweenness, closeness, cfcloseness, eigenvalue, degree, load, followers, following, favorites and status_count Degree Radial and volume-based centrality computed from 1-length walks (normalized degree) based on the flow property. This centrality was computed for both directions of the directed graph. Eigenvalue Radial and volume-based centrality computed from infinite length walks. This centrality was computed for both directions of the directed graph. Closeness Radial and length-based centrality that considers the length of the shortest past of all nodes to the target node based on the flow property. This centrality was computed for both directions of the directed graph. Betweenness Medial and volume-based centrality that considers the number of shortest paths passing by a target node based on the flow property. This centrality was computed for both directions of the directed graph. Current flow Closeness Radial and length-based centrality based on current flow model using the inverse flow property. This centrality was computed for the largest connected undirected subgraph. Current flow Betweenness Medial and volume-based centrality based on current flow model using the inverse flow property. This centrality was computed for the largest connected undirected subgraph. Load The load centrality of a node is the fraction of all shortest paths that pass through that node. Load centrality is slightly different than betweenness. he scatter plots shows that the is no significant correlation between variables except for the pair betweenness and load centralities as it is expected expected because they have similar definitions. This fact is remarkable as different centrality metrics provide a different perspective of leading nodes within the network and it does not necessarily correlates with the amount of related users, but also in the content dynamics.
Ranking
Users were ranked using on variable as the reference. Figure 2 shows the ranking resulting from using the eigenvalue centrality as the reference. The values were saturated to the percentile 95 of the distribution to improve visualization and avoid the effect of single values with very out of range values. This visualization confirms the lack of correlation between variables and the highly asymmetric distribution of the descriptors. Figure 3 summarizes the values of each leader for each descriptor showing that even within the top ranked leaders there is a very large variability. This means that some nodes are singular events within the network that require further analysis to be interpreted, as they could be leaders in society or just a product of the network dynamics.
Figure 2.
Mosaic of bar plots for ranked users according to eigenvalue centrality. Descriptors shown are: cfbetweenness, cfcloseness, eigenvalue, followers, following, favorites and status_count. Each bar for each user.
Figure 4 shows the ranking resulting from using current flow betweenness centrality as the reference. In this cases, the distribution of this reference variable is smoother and shows a more gradual behavior of leaders.
Figure 3.
Distribution of the ranked users for each descriptor.
The occurrence of nodes with centrality values very far away from the distribution average is an important phenomenon when study social leaders. It has implications in the distribution of information and misinformation. Each descriptors behaves differently as can be observed from the comparison between Figure 2 (ranking by eigenvalue) and Figure 4 (ranking by current flow betweenness).
Figure 4.
Mosaic of bar plots for ranked users according to current flow betweenness. Descriptors shown: cfbetweenness, cfcloseness, eigenvalue, followers, following, favorites and status_count. Each bar for each user. .3 Network
To assess how the nodes with high relevance are distributed with projected the network into graphs by selecting the subgraph of nodes with a certain level of relevance (threshold on the network). The resulting network graphs may not be therefore connected. The eigenvalue-ranked graph shows high connectivity and very big nodes (see Fig. 5). This is consistent with the definition of eigenvalue centrality that highlights how a node is connected to nodes that are also highly connected. This structure has implications in the reinforcement of specific messages and information within high connected clusters which can act as promoters of solutions or may become lobbies of information.
Figure 5.
Graph of high-eigenvalue users.
The current flow betweenness shows an unconnected graph which is very interesting as decentralized nodes play a key role in transporting information through the network (see Fig. 6). The current flow closeness shows also an unconnected graph which means that the social network is rather homogeneously distributed overall with parallel communities of information that do not necessarily interact with each other (see Fig. 7).
Figure 6.
Graph of high-current flow betweenness users.
Figure 7.
Graph of high-current flow closeness users.
By increasing the size of the graph more clusters can be observed, specially in the eigenvalue-ranked network (Fig. 8). Some clusters also appear for the current flow betweenness and current flow closeness (see Fig.9 and 10). These clusters may have a key role in establishing bridges between different communities of practice, knowledge or region-determined groups. As the edges of the network are characterized in terms of flows between users, these bridges can be understood in terms of volume of information between communities.
Figure 8.
Graph of high-eigenvalue users (size 500).
Figure 9.
Graph of high-current flow betweenness users (size 500).
Figure 10.
Graph of high-current flow closeness users (size 500). DISCUSSION
The distributions of the centrality metrics indicate that there are some nodes with massive relevance. These nodes can be seen as events within the flow of communication through the network [23] that require further contextualization to be interpreted. These nodes can propagate misinformation or make news or messages viral. Further research is required to understand the cause of this massive elevance events, for instance, if it is related to a relevant concept or message or whether it is an emerging event of the network dynamics and topology. Another way to assess these nodes is if they are consistently behaving this way along time or they are a temporal event. Also, it may be necessary to contextualize with the type of content they normally spread to understand their exceptional relevance. Besides the existence of massive relevance nodes, the quantification and understanding of the distribution of high relevant nodes has a lot of potential applications to spread messages to reach a wide number of users within the network. Current flow betweenness particularly seems a good indicator to identify nodes to create a safety net in terms of information and positive messages. The distribution of the nodes could be approached for the general network or for different layers or subnetworks, isolated depending on several factors: type of interaction, type of content or some other behavioral pattern. Experimental work is needed to test how a message either positive or negative spreads when started at one of the relevant nodes or close to the relevant nodes. For this purpose we are working towards integrating a network of concepts and the network of leaders. Understanding the dynamics of narratives and concept spreading is key for a responsible use of social media for building up resilience against crisis. We also plan to make interactive graph visualization to browse the relevance of the network and dynamically investigate how relevant nodes are connected and how specific parts of the graph are ranked to really understand the distribution of the relevance variables as statistical parameters are not suitable to characterize a common pattern. It is necessary to make a dynamic ethical assessment of the potential applications of this study. Understanding the network can be used to control purposes. However, we consider it is necessary that social media become the basis of pro-active response in terms of conceptual content and information. Digital technologies must play a key role on building up resilience and tackle crisis.
ACKNOWLEDGEMENTS
We would like to thank the Center of Innovation and Technology for Development at Technical University Madrid for support and valuable input, specially to Xose Ramil, Sara Romero and Mónica del Moral. Thanks also to Pedro J. Zufiria, Juan Garbajosa, Alejandro Jarabo and Carlos García-Mauriño for collaboration.
REFERENCES
1. Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective.
ACM SIGKDD Explorations Newsletter , , 22-36. 2. Lazer, D.M.; Baum, M.A.; Benkler, Y.; Berinsky, A.J.; Greenhill, K.M.; Menczer, F.; Metzger, M.J.; Nyhan, B.; Pennycook, G.; Rothschild, D. The science of fake news. Science , , 1094-1096. 3. Bakir, V.; McStay, A. Fake news and the economy of emotions: Problems, causes, solutions. Digital journalism , , 154-175. 4. Allcott, H.; Gentzkow, M. Social media and fake news in the 2016 election. Journal of economic perspectives , , 211-236. 5. Peters, M.A.; Jandrić, P.; McLaren, P. Viral modernity? Epidemics, infodemics, and the ‘bioinformational’paradigm. Taylor & Francis: 2020. 6. Zarocostas, J. How to fight an infodemic. The Lancet , , 676. 7. Cinelli, M.; Quattrociocchi, W.; Galeazzi, A.; Valensise, C.M.; Brugnoli, E.; Schmidt, A.L.; Zola, P.; Zollo, F.; Scala, A. The covid-19 social media infodemic. arXiv preprint arXiv:2003.05004 . 8. Hua, J.; Shaw, R. Corona virus (Covid-19)“infodemic” and emerging issues through a data lens: The case of china. International journal of environmental research and public health , , 2309. 9. Medford, R.J.; Saleh, S.N.; Sumarsono, A.; Perl, T.M.; Lehmann, C.U. An" Infodemic": Leveraging High-Volume Twitter Data to Understand Public Sentiment for the COVID-19 Outbreak. medRxiv . 10. Vaezi, A.; Javanmard, S.H. Infodemic and risk communication in the era of CoV-19. Advanced Biomedical Research , . 11. Militello, L.G.; Patterson, E.S.; Bowman, L.; Wears, R. Information flow during crisis management: challenges to coordination in the emergency operations center. Cognition, Technology & Work , , 25-31. 12. Greenwood, F.; Howarth, C.; Escudero Poole, D.; Raymond, N.A.; Scarnecchia, D.P. The signal code: A human rights approach to information during crisis. Harvard, MA . 13. Gao, L.; Song, C.; Gao, Z.; Barabási, A.-L.; Bagrow, J.P.; Wang, D. Quantifying information flow during emergencies.
Scientific reports , , 3997. 14. Morales, A.; Borondo, J.; Losada, J.C.; Benito, R.M. Measuring political polarization: Twitter shows the two sides of Venezuela. Chaos: An Interdisciplinary Journal of Nonlinear Science , , 033114. 15. Pierri, F.; Ceri, S. False news on social media: A data-driven survey. ACM SIGMOD Record , , 18-27. 16. MacAvaney, S.; Yao, H.-R.; Yang, E.; Russell, K.; Goharian, N.; Frieder, O. Hate speech detection: Challenges and solutions. PloS one , . 17. Ghanem, B.; Rosso, P.; Rangel, F. An emotional analysis of false information in social media and news articles. ACM Transactions on Internet Technology (TOIT) , , 1-18. 18. Popat, K.; Mukherjee, S.; Yates, A.; Weikum, G. DeClarE: Debunking fake news and false claims using evidence-aware deep learning. arXiv preprint arXiv:1809.06416 . 19. Ruchansky, N.; Seo, S.; Liu, Y. Csi: A hybrid deep model for fake news detection. In Proceedings of Proceedings of the 2017 ACM on Conference on Information and Knowledge Management; pp. 797-806. 0. Singhania, S.; Fernandez, N.; Rao, S. 3han: A deep neural network for fake news detection. In Proceedings of International Conference on Neural Information Processing; pp. 572-581. 21. Miritello, G.; Moro, E.; Lara, R. Dynamical strength of social ties in information spreading. Physical Review E , , 045102. 22. Iribarren, J.L.; Moro, E. Impact of human activity patterns on the dynamics of information diffusion. Physical review letters , , 038702. 23. Morales, A.J.; Borondo, J.; Losada, J.C.; Benito, R.M. Efficiency of human activity on information spreading on Twitter. Social Networks , , 1-11. 24. Borondo, J.; Morales, A.; Benito, R.; Losada, J. Multiple leaders on a multilayer social media. Chaos, Solitons & Fractals , , 90-98. 25. Balkundi, P.; Kilduff, M. The ties that lead: A social network approach to leadership. The leadership quarterly , , 419-439. 26. Bodendorf, F.; Kaiser, C. Detecting opinion leaders and trends in online social networks. In Proceedings of Proceedings of the 2nd ACM workshop on Social web search and mining; pp. 65-68. 27. De Brún, A.; McAuliffe, E. Exploring the potential for collective leadership in a newly established hospital network. Journal of Health Organization and Management . 28. Fransen, K.; Van Puyenbroeck, S.; Loughead, T.M.; Vanbeselaere, N.; De Cuyper, B.; Broek, G.V.; Boen, F. Who takes the lead? Social network analysis as a pioneering tool to investigate shared leadership within sports teams.
Social networks , , 28-38. 29. Goyal, A.; Bonchi, F.; Lakshmanan, L.V. Discovering leaders from community actions. In Proceedings of Proceedings of the 17th ACM conference on Information and knowledge management; pp. 499-508. 30. Iakhnis, E.; Badawy, A. Networks of Power: Analyzing World Leaders Interactions on Social Media. arXiv preprint arXiv:1907.112832019