[PDF] A comparative study of Bot Detection techniques methods with an application related to Covid-19 discourse on Twitter

Abstract

Bot Detection is an essential asset in a period where Online Social Networks(OSN) is a part of our lives. This task becomes more relevant in crises, as the Covid-19 pandemic, where there is an incipient risk of proliferation of social bots, producing a possible source of misinformation. In order to address this issue, it has been compared different methods to detect automatically social bots on Twitter using Data Selection. The techniques utilized to elaborate the bot detection models include the utilization of features as the tweets metadata or the Digital Fingerprint of the Twitter accounts. In addition, it was analyzed the presence of bots in tweets from different periods of the first months of the Covid-19 pandemic, using the bot detection technique which best fits the scope of the task. Moreover, this work includes also analysis over aspects regarding the discourse of bots and humans, such as sentiment or hashtag utilization.

Full PDF

AA comparative study of Bot Detection techniquesmethods with an application related to Covid-19discourse on Twitter.

Marzia Antenore ∗ , Jos´e M. Camacho-Rodr´ıguez † ,Emanuele Panizzi ‡ Department of Communication and Social Research, Sapienza University ofRome Department of Computer Science, Sapienza University of Rome

Abstract

Bot Detection is an essential asset in a period where Online SocialNetworks(OSN) is a part of our lives. This task becomes more relevantin crises, as the Covid-19 pandemic, where there is an incipient risk ofproliferation of social bots, producing a possible source of misinformation.In order to address this issue, it has been compared diﬀerent methods todetect automatically social bots on Twitter using Data Selection. Thetechniques utilized to elaborate the bot detection models include theutilization of features as the tweets’ metadata or the Digital Fingerprintof the Twitter accounts. In addition, it was analyzed the presence ofbots in tweets from diﬀerent periods of the ﬁrst months of the Covid-19pandemic, using the bot detection technique which best ﬁts the scope ofthe task. Moreover, this work includes also analysis over aspects regardingthe discourse of bots and humans, such as sentiment or hashtag utilization.

According to [47], a bot is a socio-technical entity based on a software programwhose aim is to simulate human behavior in Online Social Networks(OSN) suchas Facebook, Twitter, or Instagram. Bots are conﬁgured to resemblance ashumans not just to other human users, but also to the OSN platform [28].Through diﬀerent methods such as Artiﬁcial Intelligence (AI), bots interpretthe situation and react accordingly [28]. These entities can cause malicious ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ c s . S I] F e b ﬀects as inﬂuencing in changing the online practices of human users and theirpractices in Social Networks[47], producing a detrimental impact on politics.There is proof that social bots are crucial in the propagation of fake news andmisinformation [26] [45][42] [66]. Moreover, as the bots improve how to simulatethe human behavior, the line between the human user and this socio-technicalentity becomes less clear [28], causing concern in the participation of bots inpolitical events because of the negative eﬀect on the quality of democracy [63].This fact has motivated the development of many bot detection techniquesduring the last few years[27], not always being successful in completely solvingthe problem [28].This work focuses on Twitter. Some studies estimated that around 15%of the accounts on Twitter operates automatically or semi-automatically[44].One reason which might have stimulated the rise of the number of bots is thecharacteristics of Twitter[28]. Moreover, it is worth mentioning that a bot inTwitter is regarded as a credible source of information [40]. In addition tothis, bot operated accounts can be more 2.5 times more inﬂuential than human-operated accounts [69]. The two previous facts combined with the capacity ofbots to impersonate themselves as humans might produce events that impactpolitics negatively inﬂuencing public opinion, and thus, aﬀecting drasticallydemocratic processes [31]. In particular, a signiﬁcant amount of bots has beenused as fake followers of politicians to generate a false impression of popularity[45] or utilized by regimes to spread propaganda [51]. Other studies show thatsocial bots inﬂuenced discourse in social media during Brexit Referendum [30],2017 French presidential election campaign[42], 2016 US Presidential Election[51], or 2014 Venezuelan protest [45]. Another research also displays that botsinﬂuenced the public discourse regarding climate change [70].This research is developed in the context of Covid-19 pandemic, a situationwhich have concluded in social and economic disruption, apart from the worsteconomic downturn since the Great Depression [12]. In addition, work, publicevents, sports, conferences and education system have been greatly aﬀected bysocial distancing measures who forced people out of their comfort daily routinesand face-to-face interactions. Social Networks such as Twitter have becomefundamental to allows people to stay connected and to share information, opinionsand reactions around COVID-19. As social interaction moves more and moreonline, it becomes crucial to study the activity of automated accounts that couldalter public debate on central issues such as government policy, public health andindividual decision-making in an undesirable fashion. Furthermore, many studiesshow that bots accounts play a crucial role in the spread of misinformation inTwitter[12]. As a consequence, spotting the bots is the ﬁrst step in order toimplement measures to protect the quality of democratic processes.At the time of this writing, there are already many studies that have analyzedthe public discourse on the Covid-19 pandemic on social network sites [33]. Someof them looked at emotional and sentiment dynamics on social media conversationaround pandemic related topics[53][46]. Others have focused primarily on botaccounts detection aiming to describe their behavior, in contrast with humanactivity, and their focal topics of discussion [43].2n this work, we provide the following contributions: First and foremost, it ismade a comparison between supervised bot detection methods from literature,using the metadata of a Twitter account as well as extracting information fromthe Social Fingerprint of the accounts using compression statistics. Besides, thesemethods has been implemented using the data selection technique, in which itwill be found a subset of training data which provides a consistent model withthe best balance for cross validation and cross-domain generalization[65]. Themethods implemented will be compared with Botometer v3, which was availableuntil September 2020 and it was used in several studies [67]. In addition, it wasanalysed the presence of bots in tweets from diﬀerent periods of the ﬁrst monthsof the Covid-19 pandemic, using the bot detection technique which best ﬁts thescope of the task. Moreover, this work includes also analysis over other aspectsas the distribution of bots and diﬀerences on the discourse between bots andhumans based on the sentiment and hashtags. Roadmap. In Chapter 2 , we comment on the literature reviewed todevelop this work and summarize its contributions. In

Chapter 3 , we make acomparison between the approaches presented in [75] and [65], implementinga data selection technique for both of them and using several classiﬁcationalgorithms. Moreover, the bots and human accounts are depicted utilizing someof the features computed for prediction. Eventually, the models implemented arecompared with Botometer version 3. In

Chapter 4 , we analyze the presenceof bots in speciﬁc periods of the ﬁrst months of the pandemic. Then, we studydiﬀerences in the sentiment between bots and humans in the periods studied.In

Chapter 5 , we discuss some points about the research and draw someconclusions.

Political manipulation for social bots has occurred worldwide, provokingan increasing interest in bot detection in the last decade [34]. Along thistime, both supervised and unsupervised techniques have been implemented toovercome this task. Unsupervised methods are more robust than supervised onesbecause they do not rely on ground truth quality. Research in [52] introducesCATCHSYN, an unsupervised bot detection algorithm based on a Graph miningapproach. This technique allows capturing bots through measures of normalityand synchronicity, which allows detecting rare and synchronized behaviors. Theadvantages of this algorithm are scalability and no need for parameters or labeleddata. CATCHSYN presents linear-complexity regarding the graph size and onlymakes use of topology features for the detection. The research in [62] alsopresents an unsupervised method. It uses features extracted from the retweetingpatterns of the accounts. These features are used with a clustering algorithmto distinguish between bots and humans. Besides, they introduce RTT plots,an informative visualization to observe suspicious behaviors in the retweetingpatterns of Twitter accounts. These visualizations need less information thanothers proposed in literature like [52] and [48].3upervised methods, though they might have generalization issues, are exten-sively used for bot detection [34]. In [72], it is presented a supervised methodwith more than 1000 features related to user metadata, friends, network, tem-poral, content, and sentiment. This research concluded in the ﬁrst version ofBotometer, a bot detection service available online. [74] presents an update ofthat version. This update added new features to the model and included newtraining datasets containing other types of bots. In this way, the researcherswere able to cope, at least temporally, with the paradigm shift of bots[35] andthe eﬀort of bot developers to evade detection techniques[34]. This improvementcorresponded to the third version of Botometer, available through its API untilthe end of August 2020. This version, highly used through its API by users[74], was included in several studies [67] and considered as a state-of-art methodfor bot detection [74]. We use this tool in part of our experiments. Then,[71] introduces Botometer version 4. This research proposes an Ensemble ofSpecialised Classiﬁers. This approach consists of generating speciﬁc models forbot-operated accounts with diﬀerent behaviors and then combine them throughan ensemble and a voting system. It aims to deal with performance decreasewhen the training data present accounts with diﬀerent behaviors. This alterna-tive avoids retraining the model with a vast amount of data, which would becostly. Another problem that supervised methods may have is the lack of labeleddata. [56] presents a way to deal with this possible lack of data. This researchuses data generation to create training data to feed a model that combinestweets’ metadata with its content through an LTSM neural network. Usinglanguage-related features may provoke performance reduction when the modelsevaluate accounts interacting on other languages. Models in [54] and [61] addressthis issue, focusing on building language-independent models. The model in[54] used the tweets ’ metadata to determine if an account is a bot or human.The research in [61] also introduces one method that is language-independent,which uses expressive account-based and content-based features. Others setbacksthat can face supervised models are interpretability and noisy training data.Interpretability is an issue in ML algorithms, which may fall in the black-boxmetaphor, not letting humans understand the intermediate processes betweenan input and an output. The study in [59] approaches this issue extractingthe features applying the contrast-pattern technique on aspects of the accountssuch as usage, information, content-sentiment, or tweet content. Through thismethod, the model implemented is interpretable, enabling humans to understandwhy an account is classiﬁed as bot or human. Data noise in training data is aproblem that may provoke a reduction of performance in a bot detector. [75]uses a data selection technique to tackle this. This technique consists of choosinga subset of training data to optimize the performance of the model. It is anexcellent method to maximize the existing available resources giving optimalresults. Besides, in this research, it is presented a scalable classiﬁer with 20features. Scalability is essential when analyzing OSN because of the high volumeof data. For our experiments, we make use of this method. Research in [64]also introduces a scalable supervised model. It utilizes partial information ofan account and its corresponding tweet history to detect content polluters in4eal-time.As previously mentioned, bot detection is an evolving ﬁeld because as soonas a new method appears, malicious bot developers work to beat it. Intendingto detect the evolving trend of bots exposed in [35], research in [36] introducesthe Social Fingerprinting technique. Social Fingerprinting models the onlinebehavior of an account using the Digital DNA. Digital DNA is a string thatencodes the diﬀerent types of account interactions. Research in [36] presents howto exploit Social Fingerprinting in both a supervised and unsupervised fashionusing Lowest Common Substring(LCS) as a similarity measure between DNAstrings. [38] utilizes the former method to overcome a bot detection analysisover stock microblogs on Twitter. [55] and [65] present supervised models thatuses Digital DNA. [55] employs Statistical Measures for Text Richness andDiversity to extract the features from the Digital DNA. [65] applies a losslesscompression algorithm to the DNA string to obtain compression statistics asfeatures. These features allow separating bot accounts and human-operatedaccounts, even permitting to visualize the division. Part of our work aims tocombine this method with the data selection technique to build a robust methodto detect bots across several domains.Existing literature studied bot presence during the Covid-19 pandemic, suchas [43]. The study described and compared the behavior and discussion topics ofbots and humans. Alternatively, other works analyzed the discourse during theCovid-19 pandemic on Online Social Networks(OSN). For instance, [53] and [46]studied emotional and sentiment dynamics on social media conversation aroundpandemic related topics.

In this section, all the details about bot detection are explained. First, it isexposed how the features for bot detection were obtained and the diﬀerent setsof features used. Then, the datasets used for training and test are presented.Moreover, the accounts from all the datasets are represented regarding a set ofthe features computed for bot detection. Finally, a comparison is made betweenthe results of the diﬀerent models implemented using a data selection techniqueand Botometer.

The features that we use for bot detection model can be split into two groups:those obtained and derived from the metadata of each account and the variablesobtained through the Social Fingerprint technique using compression statistics.The ﬁrst approach consists of using as features for detection the metadata ofeach account, and new variables derived from the raw metadata. The metadatais retrieved from the User Object related to each account. The features retrieveddirectly from the User Object are: 5 statuses count : number of tweets posted, including retweets. • followers count : number of followers. • friends count : number of accounts followed. • favourites count : number of tweets liked by the account. • listed count : number of public lists in which the account is involved. • default proﬁle : boolean indicating if the proﬁle’s theme or background hasbeen altered. • veriﬁed : boolean indicating that the user has a veriﬁed account.To compute some derived features from the metadata, the variable user age isused. user age corresponds to the diﬀerence in hours between the creation timeof the last tweet accessible (probe time) and the creation time of the user[75].The features derived from the metadata of the User Object are: • screen name length : length of screen name string. • num digits in screen name : number of digits in screen name string. • name length : length of name string. • num digits in name : number of digits in name. • description length : length of description string. • friend growth rate : friends count/user age • listed growth rate : listed count/user age • favourites growth rate : favourites count/user age • tweet freq : statuses count/user age • followers growth rate : followers count/user age • followers friend ratio : followers count/friend count • screen name likelihood : It corresponds to the geometric mean of the likeli-hood of all bigrams in a screen name. More than 2 million unique screennames from random accounts of Twitter were retrieved to compute thelikelihood of each one of the 3969 bigrams which can be created usingthe characters allowed in the screen name (Upper and lower cases letters,digits and underscore). 6he intuition behind screen name likelihood is that the screen name of botoperated accounts sometimes are constituted by a random string [75], being thisa distinctive characteristic from humans.The second approach, Social Fingerprinting, is a technique that consists ofmodeling the behaviour of an account through the Digital DNA, which is astring of characters based on the sequence of actions of a Twitter account. Thisstring is produced encoding the behaviour through a mapping between the sortof interactions and characters or bases producing a DNA string. These basesform a set of unique characters called the alphabet. The alphabet is used togenerate a sequence represented by a row vector or string which encodes a userbehaviour [36]. More formally, an alphabet B is deﬁned as [65] B = { B , B , . . . , B N } B i (cid:54) = B j ∀ i, j = 1 , . . . , N ∧ i (cid:54) = j which is utilised to generate a sequence whose expression is s = ( b , b , . . . , b n ) = b b . . . b n b i ∈ B ∀ i = 1 , . . . , n For our experiments, the following alphabet is used to encode a Twitter userbehaviour: B type =  A ← tweetC ← replyT ← retweet  = { A , C , T } The behaviour of a Twitter account is captured through its timeline and itis utilised to generate the DNA sequence. For instance, if an account x ﬁrstdid a retweet, then two tweets and ﬁnally a retweet, its sequence utilising B type is T AAT . From here, it is implied that the length of the sequence dependson the number of tweets which are considered. In our case, we retrieved themaximum possible number of tweets(including retweets and replies) for eachaccount, having the 3200 most recent tweets as a limit because of Twitter APIrestrictions[21]. The accounts which are protected or not possess any timelinecannot be analysed with this methodology.The DNA sequences generated from the timelines are compressed usinga lossless compression algorithm. Then, we compute the following features original size of DNA string , compressed size of DNA string and compressionratio ( original DNA size / compressed DNA size ).For our experiments we use the set of features listed below: • The features extracted and derived from the User Object previously intro-duced. This set of features is denoted as

Light . • The original size of the DNA string and the compressed size of the DNAstring. This set of features is referred as A . • The original size of the DNA string and the compression ratio. This set isdenoted as B . • The compressed size of the DNA string and the compression ratio. Thisset is referred as C . 7 The original size of the DNA string, the compressed size of the DNA stringand the compression ratio. This set is denoted as D .The set light corresponds to the features used for bot detection in [75] withthe exception that it is not included the feature proﬁle use background image since it has been deprecated from the Twitter API [23]. This set of featuresallows implementing a scalable bot detection technique since each tweet retrievedwith the Twitter API (versions 1.1 and Gnip 2.0)[39] contains the User Objectof the corresponding account, with no need of obtaining extra data.[75] However,this sort of approach can be vulnerable to adversarial attacks [34]. The setof features A , B , C and D are based on the research in [65]. This techniqueprovides a detection model which is more resistant against adversarial attacks[65], but scales worse. In this section, the datasets used for the implementation of the bot detectionmodel are presented. Following the procedure in [75], we used some datasetsfor train and other datasets are set aside for testing. In this way, we expect tobuild a bot detection model that not just performs properly in cross-validationon the data used for training, but also generalises well when it is used foraccounts displaying new behaviours, obtaining cross-domain validation. Mostof the datasets have been obtained from https://botometer.iuni.iu.edu/bot-repository or in other public repositories online.The datasets used for training are: • Caverlee : To form this dataset, honeypots accounts were used to attractbot-operated accounts, mainly consisting of spammers, malicious, promot-ers, and friend inﬁltrators. This dataset was presented in [58]. • Cresci-17 : The dataset was constructed using human annotators, labeledaccounts from other datasets, and bot accounts purchased in online markets.The bots in this dataset include retweets spammers for political campaigns,hashtags spammers, URL spammers, job promoting bots, fake followers,and URL scammers. The dataset is used in [35]. • Varol : The dataset was built annotating several accounts manually fromdiﬀerent deciles of Botometer scores. It was ﬁrst used in [72]. • Pronbots : The dataset was ﬁrst shared in GitHub by a researcher in May2018. The bots are Twitter advertising scam sites. It was used in [74]. • Political : It consists of politics-oriented bots that were shared by Twitteruser @john emerson. It was extracted from [74]. • Botometer-feedback : It is made of those accounts which were annotatedmanually after been reported by Botometer users. It is used in [74].8

Vendor-purchased:

It is uniquely composed of bots that play the role offake followers. These accounts were bought by researchers from severalcompanies. This dataset is used in [74]. • Celebrity : This dataset, composed uniquely by human accounts, wasextracted from [75]. It was created by selecting Twitter accounts fromcelebrities.

Training datasets User Object SocialFingerprinting

Human Bot Human Bot botometer feed 347 108 337 107varol 1525 690 1331 668political 0 13 0 13cresci 17 2907 5925 2440 5607celebrity 5814 0 5763 0vendor 0 731 0 718pronbots 0 1899 0 1723caverlee 15211 14619 12824 14156Table 1: Number of bot and human accounts in each training dataset. It isdisplayed the number of accounts for the cases where we use the features fromthe User Object and the Social Fingerprint.The datasets used for test are: • Botwiki : This dataset consists of 704 bot operated accounts. It is formedfrom active Twitter bots from botwiki.org . On this website, internetusers can ﬁnd an archive with self-identiﬁed bots. It is utilised in theresearch conducted in [75]. • veriﬁed : It is composed of human-veriﬁed user accounts extracted throughthe Twitter streaming API. It is utilised in [75]. • Rtbust : The dataset was created manually annotating the retweets retrievedfrom the last 12 days of June 2018. It was extracted from [62]. • Stock : The bot operated accounts were detected through similarities intimelines of accounts that contain tweets with speciﬁc cashtags in a ﬁvemonths period in 2017. In [37] and [38], it is found the study throughwhich the bot-operated accounts were detected and details about theseaccounts. The bots in this dataset present a coordinated behaviour. • Gilani : The dataset was formed retrieving accounts with the TwitterStreaming API and splitting them into four groups regarding its followers.Then, accounts from each group were extracted and annotated manually.The dataset was used in [49]. 9

Midterm : The dataset is composed of accounts that interacted during 2018U.S. Midterm elections. The accounts were manually annotated as botand human through the correlation between the tweeting timestamp andcreation timestamp. The dataset is utilised in the research conducted in[75]. • Kaiser : The accounts labeled as human correspond to those belonging toAmerican and German politicians under the assumption that all are human-operated. On the other hand, the bot operated accounts are manuallyannotated for German accounts and extracted from botwiki.org in thecase of English bots. This dataset was used in [67].The botwiki and veriﬁed datasets are considered together during the test asthe botwiki-veriﬁed . It is worth to mention that the datasets used for training arethe same that in [75], whilst for testing, the datasets stock and kaiser are addedto the datasets already used in [75]. Including two more datasets for testing, wewant to test the models with other bots with diﬀerent natures.In Table 1 and Table 2 the number of bot and human accounts whichconstitutes each dataset for the train and test is displayed . The tables aredivided between user object and Social Fingerprinting because, as mentionedbefore, it is not possible to use DNA methods with those accounts which areprotected or do not have timeline. Even though there are diﬀerences in thenumber of accounts in most of the datasets, these diﬀerences are thought notto be big enough to be misleading when the user object and Social Fingerprintapproaches are compared.

Test Dataset User Object SocialFingerprinting

Human Bot Human Bot

Rtbust 332 321 317 314Gilani 1418 1043 1293 997Kaiser 1007 290 959 232Botwiki-veriﬁed 1985 685 1974 610Midterm 7416 37 7290 32Stock 6132 6964 5333 6246Table 2: Number of bot and human accounts in each training dataset. It isdisplayed the number of accounts for the cases where we use the features fromthe User Object and the Social Fingerprinting.

Following the approach of [65], we elaborate 2-D scatterplots representingthe accounts in the datasets used in our work through the compression statistics.Figure 1 displays all the datasets used for training represented by the three10ombinations of compression statistics. Figure 2 conveys the same with each oneof the test datasets.Figure 1: Scatterplot representing accounts in train datasets through compressionstatistics.Figure 2: Scatterplot representing accounts in test datasets through compressionstatistics.These plots intend to show that these features are not just useful to separatehumans from the bots from a speciﬁc dataset, but it can be generalised to morecases. In fact, in most of the datasets, it is observed that there is a divisionbetween the bot and human-operated accounts.Besides, it is worth to mention the case of stock dataset. In this dataset, thebots have a coordinated nature that makes inconvenient feature-based classiﬁers11o detect them[75]. However, looking at the scatterplot it seems that compressionstatistics achieve to separate both types of accounts. These plots can give ushints about the predictive power of models using these features for detection.

According to learning theory, using as much data as possible to train a modelwill provide the best models if the following conditions are met [75]: • The labels of the train data are correct. • The data considered is independent and identically distributed in thefeature space.In case these conditions are not met, a data selection method can be employed.This method aims to encounter a subset of training data that will optimise thecross-validation performance on the training data and the ability of generalizationon unseen data[75]. Data selection techniques have shown satisfactory resultsin diﬀerent domains with data with noise and contradictory labels[73] [41] [76].The data selection technique will be used over the training data. Speciﬁcally,all the diﬀerent combinations of train datasets are used, which supposes 247diﬀerent combinations.Then, for each combination of datasets, each one of the sets in section 3.1 isused with the following classiﬁcation algorithms:

Logistic Regression , AdaBoost , Support Vector Machine with Linear Kernel , Random Forest , Gradient Boosting , K Nearest Neighbors (KNN) , Naive Bayes , Multilayer Perceptron (MLP) . Eachpossible combination is evaluated in all the test datasets using the AUC score.Using several classiﬁcation algorithms, we intend to make a more intensive searchto ﬁnd the best performing model in [75], not just using combinations of datasetsbut also adding classiﬁcation algorithms to the equation.The MLP is composed by one hidden layer in the case of the set of features A , B , C (120 neurons) and D (150 neurons), and two hidden layers in the case of Light (300 and 200 neurons). We use the default hyperparameters of the librarysklearn for the other algorithms.For the rest of the section, we will denote model as the vector of the form( x, y, z ); x ∈ X, y ∈ Y, z ∈ Z where X corresponds to the set composed by the 247 possible combinations oftraining datasets, Y is the set formed by all the classiﬁcation algorithms and Z is the set formed by the set of features Light , A , B , C , D .We created 9880 diﬀerent models, based on 247 train datasets, 8 algorithms,and 5 sets of features. Through our heuristic process, we selected 5 of them, i.e.the best model for each set of features. The process is the following:1. We group the models by feature set (obtaining 5 groups), and in eachgroup we validate each of the 247 × Featureset Model Trainingdataset AUC Scores

Rtbust Gilani Kaiser Botwiki-veriﬁed Midterm Stock 5-fold

Light GradientBoosting botometer feed,varol,cresci 17,celebrity 0.613 0.631

C RandomForest political,cresci 17 0.660 0.691 0.927 0.980 0.944 0.863

D LogisticRegression botometer feed,cresci 17 0.699 0.719

Table 3: Best model for each set of features with their 5-fold cross validationand their performance in each test set.In Table 3, the best models according to our heuristic for each set of featuresare shown, along with the AUC score of the models in each test dataset and 5-foldcross-validation. We observe that the models with the features obtained throughSocial Fingerprint outperform or obtain similar results that the

Light model in allthe cases. The stock dataset is where the DNA models outperform more evidentlythe

Light model, with the model with the set of features D obtaining the bestresult. This is because the bots in the stock dataset showed a coordinatedbehaviour that makes a feature-based model as Light not convenient for theirdetection [75], while as evidence shows the Social Fingerprint together with13ompression statistics is an eﬀective method to detect bots with a coordinatedbehaviour. Besides, we observe that the data selection technique is eﬃcacioussince none of the best models for each set of features used all the train datasets.

We made a performance comparison of the best models with the sets offeatures

Light and D with Botometer. Botometer is an online social tool forbot detection. For the experiments, Botometer version 3 was used, which wasavailable until the end of August 2020 through its API. Botometer version 3has been used in several studies in literature and it has even been contemplatedas the state-of-the-art tool for the detection of bots in Twitter [71]. It isa supervised model, speciﬁcally, it uses a Random Forest as a classiﬁcationalgorithm. Botometer v3 uses more than 1000 features from each accountrelated to diﬀerent ﬁelds such as the content of the tweets, its sentiment, thenetwork of the account, or the user metadata [72]. This model has been trainedin the following datasets: caverlee , varol , cresci-17 , pronbots , vendor , botometer-feed , celebrity and political [72].The three models present some signiﬁcant diﬀerences. Both Botometer v3and the model Light use features extracted from the account, whereas the modelwith D needs to construct the Digital DNA from the timeline of an account forprediction. Another diﬀerence is the number of features that use each modelto classify an account. While Botometer v3 uses more than 1000 features, themodel with Light utilises 19 features and D uses 3. However, the main diﬀerencebetween all the models comes with scalability: while the model with Light allowsto analyse accounts at the same pace that the tweets are retrieved, the othermodels need to cope with Twitter API rate limits since they need to retrievethe timeline of each account for classiﬁcation, making them not scalable for theTwitter streaming. In this experiment, apart from the AUC score, the followingmetrics are used to measure the performance of each model: F1, Accuracy, Recall,Precision, and Speciﬁcity. To compute the previous metrics is necessary to set aclassiﬁcation threshold. In the case of the Botometer v3, following research [60],0.3 is used as the threshold to separate humans from bots. That is to say, if theprobability of an account to be a bot is greater than 0.3, then it is classiﬁed as abot. This probability will also be referred as bot score. In the case of the modelwith the set of features D and Light , as done in [75], it is used as thresholdthe bot score that maximizes the F1 metric, maximizing precision and recallsimultaneously.In Table 4, the performance of the three models is displayed. We observethat the model with the set of features D performs consistently well overall,outperforming or obtaining similar results to the other two models. It is worthto mention the good performance of the model with D in the stock dataset,where it performs the best. This gives evidence that the compression statisticsextracted from the Digital DNA can detect bots that behave coordinately ashappens in stock . Moreover, by combining D with data selection is possible tobuild a classiﬁer that can generalise properly in diﬀerent domains. Alternatively,14 estDataset Model Evaluation metrics AUC F1 Accuracy Recall Precision SpeciﬁtyBotwiki-veriﬁed

Light 0.990 0.916 D Gilani

Light 0.631 0.274 0.615 0.172 0.681 D Kaiser

Light D Midterm

Light

D 0.962 0.051 0.859 0.875 0.027 0.859Botometer 0.958 0.101 0.912

Rtbust

Light 0.613 0.217 0.536 0.131 D Stock

Light 0.631 0.375 0.495 0.285 0.548 D Table 4: Comparison in performance of Botometer v3 and the best models withthe set of features

Light and D .the model with Light , except for the stock dataset, produces similar results thatthe other models, on some occasions outperforming them. Besides, it showsthe best speciﬁcity in all cases and it is scalable. As expected the model with

Light does not perform properly in stock because of the coordinated behaviourof the accounts[75]. In contrast, Botometer seems to be more robust againstthe bots in stocks , probably because its features cover more aspects apart fromthe user metadata. Results also conﬁrm that is possible to obtain competitiveperformance using just a small set of features, as models with

Light and D ,rather than a bigger one as Botometer. Many studies suggest how bots would manipulate public debate. Thisbehaviour would be particularly dangerous in the context of global healthemergency. We then posit a main research question:

To what extent bots try to push disturbing action during the Covid-19 pan-demic, in general and in relation to speciﬁc topics?

More speciﬁcally, 15 hat is their prevalence and volume of posts activity compared to that ofhuman accounts?Does they exhibit any diﬀerence in the sentiment of the posts they sharecompared to ones shared by humans?

To answer these questions, we study the bot presence on speciﬁc topicsduring periods of the ﬁrst months of the pandemic. Then, after the bot detectionanalysis, we present the diﬀerences in the discourse between humans and bots,focusing on sentiment and hashtags. Through sentiment analysis we estimatethe public opinion on a certain topics and would also track COVID-19-relatedexposure to negative content in online social systems caused by bots activities.As regards procedure, we used hashtags to identify the tweets which wererelated to the same topic. We considered that two tweets belong to the sametopic if they contain the same hashtags or a subvariant of them. For instance,the tweets with hashtags COVID19, covid, Covid19, CovidPandemic belongs tothe topic COVID.The tweets used for the experiments of this section were extracted frompublic datasets in [33][57][29] or Kaggle datasets. These datasets are composedof extracting tweets through the Twitter Streaming API. The tweets extractedcontain speciﬁc hashtags or keywords with their variants related to COVID-19,or belong to speciﬁc accounts such as the World Health Organization (WHO).Even though most of the datasets contained tweets in several languages, theyare mostly composed of English tweets since the hashtags or keywords used toextract the tweets refer to English terms. This fact implies that the tweets aremostly related to events in English-speaking countries such as U.S. or U.K. Thesedatasets, due to Twitter regulations, contain the IDs of the tweets. Therefore, itwas necessary to hydrate those IDs using the twarc library [7] to obtain the fulltweet object. We only consider English tweets for our experiments.The topics and periods that we consider in our experiments are listed below: • Topic WUHAN on 25th and 26th January 2020. • Topic OUTBREAK on 25th and 26th January 2020. • Topic COVID on 28th and 29th March 2020. • Topic LOCKDOWN on 10th May 2020. • Topic TRUMP from 4th February to 21th February 2020.As studies suggest that social media discourses mirror oﬄine events dynamics,these topics and periods were studied as they were considered as prone for thepresence of bots as they reﬂect some controversial issues in people’s conversations.WUHAN and OUTBREAK refer to the pandemic beginning where the virushad rapidly spread in China and received names such as ”Wuhan virus” or”Wuhan coronavirus”. In this context, authorities canceled large-scale eventssuch as the Spring Festival, and there were traveling restrictions for more than 30million people. These facts constituted an unprecedented event [13]. Moreover,16opic Accounts TweetsOUTBREAK 64602 82030WUHAN 103916 163723COVID 312034 414097LOCKDOWN 26813 31052TRUMP 10144 26865Table 5: Number of accounts and tweets for each one of the cases studied.15 Chinese cities suﬀered partial or full lockdowns to attempt to limit the spreadof the coronavirus [10].The COVID topic on 28th and 29th March coincides with Trump consideringquarantining New York [5] as there was a shortage of equipment for healthworkers and hospitals were overloaded [15][16]. Moreover, the milestone of 2000deaths in the US was overcome in these days [15].In the scope of LOCKDOWN on 10th May, there was a high criticism raisedfrom the ﬁrst steps out of the lockdown proposed by UK Prime Minister, BorisJohnson.[3]Finally, the TRUMP case refers to the management of the start of thepandemic by President Trump, which was highly-criticized. In this period, therewere problems with the COVID testing in the U.S.[22], making it diﬃcult to stopthe spread of the virus. Besides, little attention was given to the coronavirus inthe State of Union on 4th February, where President Trump spent less than 30seconds referring to the COVID-19 situation[14]. Moreover, during this time,the US government had to manage the Diamond Princess cruise situation, whereit was criticized the conditions around the Americans in the ship during themonth of February[24].Table 5 displays the number of unique tweets and accounts considered byeach topic after hydrating the tweets. We use these tweets for our experiments.

For the bot detection analysis, we use the model

Light as it displayed goodresults in section 3.3 and scalability. First, we study the distribution of thebot score in each one of the cases. The distributions are displayed in Figure3. The decision threshold corresponds to the one computed in 3.3. All thedistributions are positively skewed, indicating a bigger presence of the humanthan bots. Moreover, except for the TRUMP distribution, it is observed a cleartail.Then, we study if the distributions are similar between them. We run theAnderson-Darling statistical test to analyze if the samples of bot scores come fromthe same distribution. After running the test for all the pairs of distributions,we reject the null hypothesis at a 1% signiﬁcance level. We conclude that thereis statistically signiﬁcant evidence to state that the samples for each case do not17igure 3: Bot score distribution for each of the cases studied.come from the same distribution.Figure 4: Proportion of bot and human accounts that interacted in each case.Besides, we classify each account as a bot or human using the decisionthreshold computed in 3.3. Figure 4 displays the proportion of bots and humanaccounts identiﬁed in each case. We notice that OUTBREAK and WUHANcases have the smallest amount of bots, with only around 7% bot-operatedaccounts. In COVID and LOCKDOWN, about 10% and 12% of the accounts18re bots. The TRUMP case has the maximum proportion of bots with morethan 18%.Figure 5: Proportion of tweets which were produced by bots and humans in eachof the cases studied.Then, we compute the number of tweets produced by bots and humans ineach case. Figure 5 displays a comparative bar chart with the proportion oftweets created by bots and humans in each topic. We observe that in all thecases, except for TRUMP, the proportions of each type of account and tweetsmade by those accounts are analogous, not diﬀering in more than 3%. This factindicates that bots and humans as a group present the same rate of activity inthese cases. By contrast, in the TRUMP case, we see that bots are more activethan humans. The bots, only 18.26% of the accounts, produce the 55.73% oftotal tweets in this case.

In order to understand whether bots would increase exposure to negativeand inﬂammatory content in online social systems, we analyze tweets’ contentdiﬀerences regarding bots and humans in each case. Sentiment analysis allows usto monitor social media to extract an overview of the opinion of Twitter users.First, we implement sentiment analysis in each one of the situations usingVADER. We analyze the sentiment to learn about the reactions of users in eachone of the situations studied. Then, the sentiment analysis was extended forthe LOCKDOWN and TRUMP cases, using only the hashtags in the tweets topredict tweets’ sentiment. Eventually, we examine the most common hashtagsfor bots and humans and discuss diﬀerences between each group.19 .2.1 Sentiment Analysis using VADER

We use VADER[11] to implement the sentiment analysis for all the cases.VADER is a sentiment model speciﬁcally designed to analyze microblog-likecontents as tweets. To predict the sentiment, VADER uses a list of lexicalfeatures with their corresponding gold-standard sentiment intensities, combinedusing a set of ﬁve grammatical rules. According to the study in [68], where ithas been benchmarked more than 20 techniques using 18 datasets, VADER isone of the best sentiment analysis methods for Social Media messages. Apartfrom its performance, we choose VADER because of its scalability and its simpleutilization. There is a VADER implementation available in the NLTK library[32].Besides, it needs little preprocessing compared to other methods. We applythe following preprocessing steps to the tweet content before using the VADERsentiment analyzer:1. Remove extra white spaces.2. Remove links and/or URLs.3. Remove username.4. Remove RT symbol.5. Remove HTML elements.6. Remove • Positive: compound score ≥ • Neutral: -0.05 ≤ compound score ≤ • Negative: compound score ≤ -0.05 Figure 6 displays the proportions of tweets for each case after using thesentiment thresholds above.We observe that the case OUTBREAK show similar proportions for botsand human. There is a greater presence of positive and neutral tweets (around80%), being the negative tweets the minority.20igure 6: Sentiment of the tweets for each of the cases studied for human andbot operated accounts.Regarding WUHAN, we also notice similar proportions between humans andbots. In contrast to OUTBREAK, there is a bigger proportion of negative andneutral tweets, being the positive tweets the minority with only around 18%for bots and humans. It is worth mentioning that even though WUHAN andOUTBREAK are highly related and it is considered the same period, they showinverse behaviors.Regarding COVID, we notice that both humans and bots produced similarproportions for negative, neutral, and positive tweets. The former fact mightindicate a division of users’ opinion into the measure of quarantining New York.Alternatively to the previous cases, we see that the humans and bots accountsshow diﬀerent proportions in the LOCKDOWN and TRUMP cases.In LOCKDOWN, bots show similar amounts of positive, neutral, and negativetweets. However, humans mainly display a negative tendency (50.74% of thetotal tweets), while the positive and neutral correspond to half of the tweets ina balanced way. This value might indicate public opinion disagreement with theﬁrst steps out of the Lockdown proposed by the UK Prime minister.In the TRUMP case, we observe a more evident diﬀerence between thesentiment proportions of tweets produced by bots and humans. We notice thathumans present a balance between the three classes with a little dominance ofnegative tweets (42% negative-27% neutral - 31% positive). We interpret thisresult as a light dissent of users with President Trump’s political performanceduring that period. On the other hand, negative-sentiment tweets correspond tothe majority for bots, with almost 80% of the tweets. These values represent a21rastic diﬀerence, showing that tweets generated by bots have a predominantlynegative attitude.So far, we have used thresholds and discrete labels to measure the sentiment.However, one setback of this approach is the inability to count on intensities.For instance, we cannot diﬀerentiate between an extremely and slightly negativetweet since both are considered negative. To overcome this limitation and makea more extensive study, we complemented the previous analysis by studying thesentiment with a continuous metric, .i.e. the compound score. This analysisallows us to comment also about the intensity of the tweet content.Figure 7 displays the distributions of compound scores regarding bots andhuman accounts for each case. We observe that for OUTBREAK, WUHAN, andCOVID, the location of the peaks of the distributions for human and bots aresimilar. Moreover, most of the scores are around 0 in these cases, the samplesnot presenting extreme scores. In the human distribution in the LOCKDOWNcase, we observe that the negative tweets display a more extreme score (peakbetween -0.6 and -0.8) than those positive (less than 0.5). This fact explains thathuman users were more drastic when they refer negatively to Lockdown thanwhen they referred positively. Besides this case, it is the only distribution wherewe can notice two peaks, one in the neutral interval and one in the negativescores. Alternatively, regarding bots in the LOCKDOWN case, we observe thatthe positive tweets are close to the central scores, while we notice negative scoresalong the spectrum, from more neutral to more extreme scores. Concerningthe TRUMP case, bots distribution only displays a peak which shows thatmost tweets have a slightly negative sentiment. In the case of humans, all thecompound scores are located in the center of the distribution. This fact impliesthat positive and negative tweets do not show extreme positions.Furthermore, we run an Anderson-Darling test to see if the samples of thecompound scores between humans and bots present the same distribution foreach case. After running the test for all the pairs of distributions, we reject thenull hypothesis at a 1% signiﬁcance level. Therefore, we conclude that thereis statistically signiﬁcant evidence to state that samples do not come from thesame distribution.The experiments in this subsection have some limitations. First, even thoughVADER presents the previously described advantages, it is not attuned fortweets that regard politics. This fact can reduce the performance of VADERon occasions. Besides, using hashtags to extract the tweets of the same topicmight be sensitive to spam. Twitter users can use hashtags to gain popularity orattention, though it is not related to the tweet content. Moreover, our hashtag-based method for extraction can retrieve some tweets which are not fully-relatedto the topic we are studying. That being said, the limitations are not thoughtto be signiﬁcant enough to not able to grasp valuable insights about the overallopinion displayed by the Twitter community about speciﬁc topics and analyzediﬀerences in sentiment between humans and bots.22igure 7: Distribution of sentiment compound score for each case regardinghuman and bot accounts.

We evaluate the sentiment through the hashtags in the tweets. By doing so,we expect to overcome some of the limitations exposed in the previous section andmake a more extensive analysis. Previously, manually labeling all the hashtagsin the tweets as positive, negative, and neutral, we follow the following approachto obtain the sentiment of the tweets: • If a tweet contains at least one positive hashtag, the tweet is labeled aspositive. • If a tweet contains at least one negative hashtag, the tweet is labeled asnegative. • If a tweet contain does not contain positive nor negative hashtag, the tweetis labeled as neutral. • If a tweet contain at least a positive hashtag and a negative hashtag, thetweet is labelled as inconclusive.It is worth mentioning that all the tweets evaluated contain at least onehashtag because of the extraction method. Moreover, as results will convey,inconclusive tweets are a minority since a user will refer to negative or positivehashtags regarding a topic, not with both.23n particular, we only evaluated the topics LOCKDOWN and TRUMP sincethey show a higher polarity. We expect to gain insights into the opinion of usersregarding Trump’s political performance and Lockdown measures. The hashtagswere manually labeled following speciﬁc guidelines for each one of the cases.We followed the rules below to label the hashtags in the LOCKDOWN tweets: • It is assigned +1 (positive) to all hashtags which display a favourableattitude towards the lockdown and individual protection measures. • It is assigned -1 (negative) to those hashtags against the lockdown andindividual protection measures. • The rest of the cases are labelled as 0 (neutral).We followed the guidelines below to label the hashtags in the TRUMP tweets: • It is assigned +1 (positive) to those hashtags in favour of Trump or itscampaign, the GOP, or conspiracies theories who support the ﬁgure ofTrump. Hashtags containing slogans pro-Trump are also labeled as 1. • It is assigned -1 (negative) to those hashtags which shows an oﬀensiveattitude towards Trump, including nicknames. It is also given -1 to thosehashtags which were against GOP, constitutes sarcastic slogans, or are infavour of the democratic party. • It is given 0 to the rest of the hashtags.Using the previous instructions, in the LOCKDOWN case, we labeled 221negative hashtags and 241 positive hashtags out of the 14376 in the LOCKDOWNtweets. Otherwise, in the TRUMP case, we obtained 938 negative hashtags and367 positives out of 9678 total hashtags. Moreover, there were less than 1% ofinconclusive tweets for both cases.The results using the hashtag-based method are shown in Figure 8. Weobserve a predominant proportion of neutral tweets in all cases. This resultmatches with the nature of hashtags: they usually label tweets in a topic,expressing an opinion being less frequent. However, when they express opinionthey give us evidence of the position of the user. This fact allows us to gain moreaccurate insights into the opinion of the topics studied. In the LOCKDOWNcase, we observe twice as many tweets with positive sentiment (12.66%) thantweets with negative sentiment (6.35%). From these results, we could say thatmore people agree with the need for measures in favor of the lockdown thanpeople who do not. We observe the same tendency regarding the bots in theLOCKDOWN case; it is bigger the proportion of positive tweets than negative.In both cases, the proportion of neutral tweets supposes the majority of tweetswith 81% for humans and 71.68% for bots. For the TRUMP case, humans andbots display a bigger proportion of negative tweets than positive. However, thediﬀerences in proportions between one and the other diﬀer signiﬁcantly. Forbots, the diﬀerence between positive and negative is 3%, while neutral tweets24igure 8: Sentiment of the tweets using hashtags for human and bots for theLOCKDOWN and TRUMP case.constitute almost 85% of the tweets. Concerning humans, we observe that lessthan 50% of the tweets are neutral. We notice a bigger proportion of negativesentiment tweets than positive; 31% against 22%. This fact display that publicopinion has a more negative attitude towards Donald Trump in that period.

In this section, we explore the diﬀerences in the discourse regarding thehashtags in bots and humans. This analysis aims to see if bots and humanstweet about diﬀerent things even in the same context. Signiﬁcant diﬀerences inthe hashtags between bots and humans would imply that conversations betweenhumans and bots diﬀer. To implement this analysis, we plot, for each case, the20 most frequent hashtags used by humans and bots.Figure 9 displays the most frequent hashtags that used humans and bots forthe OUTBREAK, WUHAN, COVID cases. We observe in all three cases thathumans and bots use similar hashtags, indicating very homogeneous discourse.We list below few diﬀerences that we can spot between the hashtags in each case: • In contrast to bots, or are between the most commonhashtags used by humans in OUTBREAK. The former might be because25igure 9: Most frequent hashtags for the OUTBREAK,WUHAN and COVIDcases.human users are sharing pieces of information based on infographics. Thelatter could mean that human users ﬁnd similarities between the Ebolaoutbreak in Europe and U.S. in 2014 and the Covid-19 situation. • In the WUHAN case, bots utilize the term to refer to COVID-19 in contrast to humans. • In the COVID case, we can see support by human users to the U.S. Navywith the hashtags . This hashtag probably referwhen the U.S. Navy sent a hospital ship to help the area of New York.[19].Conversely to bots, we observe that humans use . PMCARES Fund was created in India on 27th March to ﬁght against Covid-19and analogous pandemic situations in the future [1]. On the other hand,bots in COVID share the message as a preventionmeasure for Covid.Figure 10 displays the most frequent hashtags that used humans and botsfor the LOCKDOWN and TRUMP cases. We observe in LOCKDOWN thatthe most frequent hashtags are equal for bots and humans. In general terms,we can see hashtags referring to U.K., India, or South Africa events in bothcases. For instance, refers to the U.K. lockdown, and hashtagssuch as are related to India. In India, Mother’s Day isthe second Sunday of May, which fell on 10th May in 2020 [17]. Otherwise, regards South Africa, since the 10th May was the 44th day of26igure 10: Most frequent hashtags for the LOCKDOWN and TRUMP cases.lockdown in South Africa [4]. However, one diﬀerence between bots and humansin the discourse is that humans also focused on the Lockdown in Ireland as . Besides, humans use the hashtag on its discourse,probably referring to the pressure in U.K. hospitals for the high occupancy ofthe Intensive Care Unit in the U.K. [6]. In contrast to humans, we also noticethat bots use the hashtag , referring to the violencesuﬀered by women in the Indian State of Tamil Nadu.The TRUMP case is where we observe a bigger diﬀerence between the dis-course of humans and bots. One of the main diﬀerences we spot is the pro-Trumphashtag . We also notice some other pro-Trump hashtags such as , , . Besides, the Tea Party movement ( )and Top Conservatives on Twitter ( ) should favor President Trump. Itseems humans show more evidently their support to Trump than bot-operatedaccounts. One of the most recurring topics for humans is the budget proposal ofTrump on 10th February. The proposal advocated for an increase in defense; andcuts and restrictions in foreign aid and social welfare programs[20]. Humans referdirectly to Impeachment with hashtags against Trump, such as and . Besides, humans mention the hacking attack onEquifax, which aﬀected the data of 145 million Americans[9]. On the otherhand, we observe that bots use recurrently hashtags , , and torefer to COVID-19 pandemic. Besides, we notice that bots also speak about theImpeachment, but they refer diﬀerently to it. They do not use hashtags thatdisplay opposition to Trump as humans. They utilize neutral hashtags as , or containing the name of people that participate in the process,as retired U.S. Army Lieutenant Colonel Alexander Vindman ( ) and the state of theunion speech ( ). Moreover, we also perceive that some bots aimto spread news, such as the crash of a plane from Pegasus airline ( )2718] or the avalanche in Bachcesaray, Turkey ( ) [8].To sum up, we observe that in the cases OUTBREAK, WUHAN, and COVIDexist few dissimilarities between the discourse of bots and humans regardingthe hashtags analysis. However, these diﬀerences increase in the LOCKDOWNand TRUMP, being the latter case where humans and bots diﬀer most in theirdiscourse. In this work, we produce a comparison between supervised Bot Detectionmethods using Data Selection and a case study related to the Covid-19 pandemic.The comparative study aims to ﬁnd a consistent model with the best balancebetween cross-validation and cross-domain generalization. In the comparison,we compared the method in [75] with [65]. We followed a similar pipeline to[75]. However, we extended the study using an extra test dataset, the metadatacurrently available in Twitter API, and several classiﬁcation algorithms. Besides,we applied the data selection technique to [65]. The experiments displayedthat combining the [65] with data selection produce excellent results, not onlyoutperforming the model from [75] in certain situations but also comparedto Botometer version 3. The model implemented proves to be more eﬀectivethan the other two when detecting bots that convey a coordinated behavior.Alternatively, the model with the approach from [75], after proving diﬀerentclassiﬁcation algorithms, also produces competitive results. We use this modelin the case study because of its performance and scalability.In our case study, we set forth to investigate to what extent automated botsaccounts were active on Twitter during the health global crisis due to Covid-19pandemic. Prior works demonstrated how bots acted massively in diﬀerentcontext such as elections campaigns or Brexit crisis and how they have beenused in malicious manners to spread misinformation and manipulating publicdebate. This behaviour would be particularly dangerous in the context of theglobal health outbreak when public discourse goes more and more online due tosocial distancing measures.Our ﬁndings paint a picture where while automated accounts are numerousand active when discussing some controversial issues, such as the lockdownmeasures in the UK or the pandemic beginning in WUHAN, usually they seemnot increase exposure to negative and inﬂammatory content in online socialsystems. Despite this, when discourse switch to the management of the pandemicby President Trump, bots became more and more active in the spreading ofdiscontent related to its policy decisions as a consequence of the underestimationof the outbreak. In this case, sentiment-related values display a drastic diﬀerence,showing that tweets generated by bots have a predominantly negative attitude.By evaluating the sentiment through the hashtag in the tweets, we expectto gain a deeper understanding into the opinion of bots and humans regardingTrump’s political performance and lockdown measures. Concerning humans, wecould say that more people agree with the need for measures in favor of the28ockdown than people who do not. Consistently, Trump’s policy of underestimat-ing the health emergency has been heavily criticized by human users. However,in these cases we cannot deﬁnitely conclude that the bots are responsible forexposure to negative content related to these two topics.Furthermore, this result seems consistent with the hashtags analysis aimsto explore the diﬀerences in the discourse regarding bots and humans. Signiﬁ-cant diﬀerences in the hashtags shared by human and bots would imply thatconversation between them diﬀer. While in the cases OUTBREAK, WUHAN,and COVID exist few dissimilarities between the discourse of bots and humans,these diﬀerences increase in the LOCKDOWN and TRUMP cases, being thelatter where humans and bots diﬀer mostly in their discourse. Nevertheless inthe TRUMP case it seems humans show more evidently their support to Trumpthan automated accounts, disproving the hypothesis, limited to the case study,of any conspiratorial attitude pushed by bots.

References [1] About pm cares fund for emergency or distress situations. .[2] Alexander vindman’s lawyer calls trump’s comments ’obviously false’ - bbcnews. . (Ac-cessed on 02/01/2021).[3] Boris johnson’s lockdown release condemned as divisive,confusing and vague — world news — the guardian. .(Accessed on 09/21/2020).[4] Coronavirus lockdown around the world in pictures - bbc news. .[5] Coronavirus: Trump ’considering quarantining new york’ - bbc news. . (Accessed on12/14/2020).[6] Covid-19 press conference slides 2020-05-15. https://assets.publishing.service.gov.uk . (Accessed on 02/01/2021).[7] DocNow. Twarc. 2020. https://github.com/DocNow/twarc .[8] Dozens of rescue workers killed in second turkish avalanche — world news— the guardian. . (Accessedon 02/01/2021). 299] Equifax breach, trump, coronavirus, oscars, bronx: Monday’snews. https://eu.usatoday.com/story/news/2020/02/10/equifax-breach-trump-coronavirus-oscars-bronx-mondays-news/4714180002/ . (Accessed on 02/01/2021).[10] Get caught up: here’s the latest on the outbreak. https://edition.cnn.com/asia/live-news/coronavirus-outbreak-hnk-intl-01-26-20/h_11406735d740cee8ef48f77cd0e4c057 . (Accessed on 12/14/2020).[11] Github - cjhutto/vadersentiment: Vader sentiment analysis. vader (va-lence aware dictionary and sentiment reasoner) is a lexicon and rule-based sentiment analysis tool that is speciﬁcally attuned to sentimentsexpressed in social media, and works well on texts from other domains. https://github.com/cjhutto/vaderSentiment .[12] The great lockdown: Worst economic downturn since the greatdepression – imf blog. https://blogs.imf.org/2020/04/14/the-great-lockdown-worst-economic-downturn-since-the-great-depression/ .(Accessed on 10/18/2020).[13] January 25 coronavirus news. https://edition.cnn.com/asia/live-news/coronavirus-outbreak-hnk-intl-01-25-20/index.html .(Accessed on 12/14/2020).[14] The lost month: Trump says he took ‘strong action’ in february to stop coro-navirus. here’s the full picture. https://edition.cnn.com/interactive/2020/04/politics/trump-covid-response-annotation/ . (Accessed on09/22/2020).[15] March 28 coronavirus news. https://edition.cnn.com/world/live-news/coronavirus-outbreak-03-28-20-intl-hnk/index.html .(Accessed on 12/14/2020).[16] March 29 coronavirus news. https://edition.cnn.com/world/live-news/coronavirus-outbreak-03-29-20-intl-hnk/index.html .(Accessed on 12/14/2020).[17] Mother’s day 2020: When is mother’s day in2020? — lifestyle news,the indian express. https://indianexpress.com/article/lifestyle/life-style/mothers-day-2020-when-is-mothers-day-in-2020-6393065/ . (Ac-cessed on 02/01/2021).[18] Pegasus airlines plane skids oﬀ runway, crashes in turkey- business insider. . (Accessedon 02/01/2021). 3019] President trump to see oﬀ navy hospital ship usns comfort headed fornew york - watch live stream - cbs news. .(Accessed on 01/31/2021).[20] Trump submits $ . (Accessed on 02/01/2021).[21] Twitter api documentation — docs — twitter developer. https://developer.twitter.com/en/docs/twitter-api . (Accessed on10/18/2020).[22] The united states badly bungled coronavirus test-ing—but things may soon improve — science —aaas. .(Accessed on 12/14/2020).[23] User object docs twitter developer. https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/user .[24] Warren calls on trump admin to explain process for bringing back ameri-cans infected by coronavirus - politico. . (Accessed on09/22/2020).[25] Why mitt romney voted to convict trump - the atlantic. . (Accessed on 02/01/2021).[26] Norah Abokhodair, Daisy Yoo, and David W McDonald. Dissecting asocial botnet: Growth, content and inﬂuence in twitter. In Proceedings ofthe 18th ACM conference on computer supported cooperative work & socialcomputing , pages 839–851, 2015.[27] Lorenzo Alvisi, Allen Clement, Alessandro Epasto, Silvio Lattanzi, andAlessandro Panconesi. Sok: The evolution of sybil defense via social networks.In , pages 382–396. IEEE, 2013.[28] Marzia Antenore and Elisabetta Trinca. Who bots there, friend or foe?social bots and digital platforms. In

Technological and digital risk. Researchissues . Peter Lang, 2020.[29] Juan M. Banda, Ramya Tekumalla, Guanyu Wang, Jingyuan Yu, Tuo Liu,Yuning Ding, and Gerardo Chowell. A twitter dataset of 150+ milliontweets related to covid-19 for open research. [Online].Published by Zenodo.Available from: https://github.com/thepanacealab/covid19_twitter ,2020. 3130] Marco T Bastos and Dan Mercea. The brexit botnet and user-generatedhyperpartisan news.

Social Science Computer Review , 37(1):38–54, 2019.[31] Alessandro Bessi and Emilio Ferrara. Social bots distort the 2016 uspresidential election online discussion.

First Monday , 21(11-7), 2016.[32] Steven Bird, Ewan Klein, and Edward Loper.

Natural language processingwith Python: analyzing text with the natural language toolkit . ” O’ReillyMedia, Inc.”, 2009.[33] Emily Chen, Kristina Lerman, and Emilio Ferrara. Tracking social mediadiscourse about the covid-19 pandemic: Development of a public coronavirustwitter data set.

JMIR Public Health and Surveillance , 6(2):e19273, 2020.[34] Stefano Cresci. A decade of social bot detection.

Communications of theACM , 63(10):72–83, 2020.[35] Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi,and Maurizio Tesconi. The paradigm-shift of social spambots: Evidence,theories, and tools for the arms race. In

Proceedings of the 26th internationalconference on world wide web companion , pages 963–972, 2017.[36] Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi,and Maurizio Tesconi. Social ﬁngerprinting: detection of spambot groupsthrough dna-inspired behavioral modeling.

IEEE Transactions on Depend-able and Secure Computing , 15(4):561–576, 2017.[37] Stefano Cresci, Fabrizio Lillo, Daniele Regoli, Serena Tardelli, and MaurizioTesconi. $ fake: Evidence of spam and bot activity in stock microblogs ontwitter. In Twelfth international AAAI conference on web and social media ,2018.[38] Stefano Cresci, Fabrizio Lillo, Daniele Regoli, Serena Tardelli, and MaurizioTesconi. Cashtag piggybacking: Uncovering spam and bot activity in stockmicroblogs on twitter.

ACM Transactions on the Web (TWEB) , 13(2):1–27,2019.[39] Twitter Developer Documentation. https://developer.twitter.com/en .[40] Chad Edwards, Autumn Edwards, Patric R Spence, and Ashleigh K Shelton.Is that a bot running the social media feed? testing the diﬀerences inperceptions of communication quality for a human agent and a bot agenton twitter.

Computers in Human Behavior , 33:372–376, 2014.[41] Cigdem Eroglu Erdem, Elif Bozkurt, Engin Erzin, and A Tanju Erdem.Ransac-based training data selection for emotion recognition from sponta-neous speech. In

Proceedings of the 3rd international workshop on Aﬀectiveinteraction in natural environments , pages 9–14, 2010.3242] Emilio Ferrara. Disinformation and social bot operations in the run upto the 2017 french presidential election. arXiv preprint arXiv:1707.00086 ,2017.[43] Emilio Ferrara. What types of covid-19 conspiracies are populated by twitterbots?

First Monday , 2020.[44] Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and AlessandroFlammini. The rise of social bots.

Communications of the ACM , 59(7):96–104, 2016.[45] Michelle Forelle, Phil Howard, Andr´es Monroy-Hern´andez, and Saiph Savage.Political bots and the manipulation of public opinion in venezuela. arXivpreprint arXiv:1507.07109 , 2015.[46] Junling Gao, Pinpin Zheng, Yingnan Jia, Hao Chen, Yimeng Mao, SuhongChen, Yi Wang, Hua Fu, and Junming Dai. Mental health problems andsocial media exposure during covid-19 outbreak.

Plos one , 15(4):e0231924,2020.[47] Robert W Gehl and Maria Bakardjieva.

Socialbots and their friends: Digitalmedia and the automation of sociality . Taylor & Francis, 2016.[48] Maria Giatsoglou, Despoina Chatzakou, Neil Shah, Christos Faloutsos, andAthena Vakali. Retweeting activity on twitter: Signs of deception. In

Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining , pages122–134. Springer, 2015.[49] Zafar Gilani, Reza Farahbakhsh, Gareth Tyson, Liang Wang, and JonCrowcroft. Of bots and humans (on twitter). In

Proceedings of the 2017IEEE/ACM International Conference on Advances in Social Networks Anal-ysis and Mining 2017 , pages 349–354, 2017.[50] CHE Gilbert and Erric Hutto. Vader: A parsimonious rule-based modelfor sentiment analysis of social media text. In

Eighth International Confer-ence on Weblogs and Social Media (ICWSM-14). Available at (20/04/16)http://comp. social. gatech. edu/papers/icwsm14. vader. hutto. pdf , vol-ume 81, page 82, 2014.[51] Philip N Howard, Bence Kollanyi, and Samuel Woolley. Bots and automationover twitter during the us election.

Computational Propaganda Project:Working Paper Series , 2016.[52] Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, and ShiqiangYang. Catching synchronized behaviors in large networks: A graph miningapproach.

ACM Transactions on Knowledge Discovery from Data (TKDD) ,10(4):1–27, 2016. 3353] Bennett Kleinberg, Isabelle van der Vegt, and Maximilian Mozes. Mea-suring emotions in the covid-19 real world worry dataset. arXiv preprintarXiv:2004.04225 , 2020.[54] J¨urgen Knauth. Language-agnostic twitter-bot detection. In

Proceedingsof the International Conference on Recent Advances in Natural LanguageProcessing (RANLP 2019) , pages 550–558, 2019.[55] Dijana Kosmajac and Vlado Keselj. Twitter bot detection using diversitymeasures. In

Proceedings of the 3rd International Conference on NaturalLanguage and Speech Processing , pages 1–8, 2019.[56] Sneha Kudugunta and Emilio Ferrara. Deep neural networks for bot detec-tion.

Information Sciences , 467:312–322, 2018.[57] Rabindra Lamsal. Coronavirus (covid-19) tweets dataset. [Online].Publisedby IEEE Dataport. Available from: https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset , 2020.[58] Kyumin Lee, Brian David Eoﬀ, and James Caverlee. Seven months withthe devils: A long-term study of content polluters on twitter. In

Fifthinternational AAAI conference on weblogs and social media , 2011.[59] Octavio Loyola-Gonz´alez, Ra´ul Monroy, Jorge Rodr´ıguez, Armando L´opez-Cuevas, and Javier Israel Mata-S´anchez. Contrast pattern-based classiﬁca-tion for bot detection on twitter.

IEEE Access , 7:45800–45817, 2019.[60] Luca Luceri, Ashok Deb, Adam Badawy, and Emilio Ferrara. Red bots do itbetter: Comparative analysis of social bot partisan behavior. In

CompanionProceedings of the 2019 World Wide Web Conference , pages 1007–1012,2019.[61] Jonas Lundberg, Jonas Nordqvist, and Mikko Laitinen. Towards a languageindependent twitter bot detector. In

DHN , pages 308–319, 2019.[62] Michele Mazza, Stefano Cresci, Marco Avvenuti, Walter Quattrociocchi,and Maurizio Tesconi. Rtbust: Exploiting temporal patterns for botnetdetection on twitter. In

Proceedings of the 10th ACM Conference on WebScience , pages 183–192, 2019.[63] Eni Mustafaraj and P Takis Metaxas. From obscurity to prominence inminutes: Political speech and real-time search. 2010.[64] Mehwish Nasim, Andrew Nguyen, Nick Lothian, Robert Cope, and LewisMitchell. Real-time detection of content polluters in partially observabletwitter networks. In

Companion Proceedings of the The Web Conference2018 , pages 1331–1339, 2018. 3465] Nivranshu Pasricha and Conor Hayes. Detecting bot behaviour in socialmedia using digital dna compression. In . AICS (Artiﬁcial Intelligenceand Cognitive Science) 2019, 2019.[66] Jacob Ratkiewicz, Michael D Conover, Mark Meiss, Bruno Gon¸calves,Alessandro Flammini, and Filippo Menczer Menczer. Detecting and trackingpolitical abuse in social media. In

Fifth international AAAI conference onweblogs and social media . Citeseer, 2011.[67] Adrian Rauchﬂeisch and Jonas Kaiser. The false positive problem of au-tomatic bot detection in social science research.

Berkman Klein CenterResearch Publication , (2020-3), 2020.[68] Filipe N Ribeiro, Matheus Ara´ujo, Pollyanna Gon¸calves, Marcos Andr´eGon¸calves, and Fabr´ıcio Benevenuto. Sentibench-a benchmark comparisonof state-of-the-practice sentiment analysis methods.

EPJ Data Science ,5(1):1–29, 2016.[69] Marian-Andrei Rizoiu, Timothy Graham, Rui Zhang, Yifei Zhang, RobertAckland, and Lexing Xie. arXiv preprintarXiv:1802.09808 , 2018.[70] Bj¨orn Ross, Laura Pilz, Benjamin Cabrera, Florian Brachten, GermanNeubaum, and Stefan Stieglitz. Are social bots a real threat? an agent-basedmodel of the spiral of silence to analyse the impact of manipulative actors insocial networks.

European Journal of Information Systems , 28(4):394–412,2019.[71] Mohsen Sayyadiharikandeh, Onur Varol, Kai-Cheng Yang, Alessandro Flam-mini, and Filippo Menczer. Detection of novel social bots by ensembles ofspecialized classiﬁers. arXiv , pages arXiv–2006, 2020.[72] Onur Varol, Emilio Ferrara, Clayton A Davis, Filippo Menczer, and Alessan-dro Flammini. Online human-bot interactions: Detection, estimation, andcharacterization. In

Eleventh international AAAI conference on web andsocial media , 2017.[73] Yi Wu, Rong Zhang, and Alexander Rudnicky. Data selection for speechrecognition. In , pages 562–565. IEEE, 2007.[74] Kai-Cheng Yang, Onur Varol, Clayton A Davis, Emilio Ferrara, AlessandroFlammini, and Filippo Menczer. Arming the public with artiﬁcial intelligenceto counter social bots.

Human Behavior and Emerging Technologies , 1(1):48–61, 2019. 3575] Kai-Cheng Yang, Onur Varol, Pik-Mai Hui, and Filippo Menczer. Scalableand generalizable social bot detection through data selection. In

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence , volume 34, pages 1096–1103, 2020.[76] Zixing Zhang, Florian Eyben, Jun Deng, and Bj¨orn Schuller. An agreementand sparseness-based learning instance selection and its application to sub-jective speech phenomena. In