Posting Bot Detection on Blockchain-based Social Media Platform using Machine Learning Techniques
PPosting Bot Detection on Blockchain-based Social Media Platform using MachineLearning Techniques
Taehyun Kim, Hyomin Shin, Hyung Ju Hwang, ∗ Seungwon Jeong, ∗ Pohang University of Science and Technology University of Bristol { taehyun3401,zhainl,hjhwang } @postech.ac.kr, [email protected] Abstract
Steemit is a blockchain-based social media platform, whereauthors can get author rewards in the form of cryptocurren-cies called STEEM and SBD (Steem Blockchain Dollars) iftheir posts are upvoted. Interestingly, curators (or voters) canalso get rewards by voting others’ posts, which is called a cu-ration reward. A reward is proportional to a curator’s STEEMstakes. Throughout this process, Steemit hopes “good” con-tent will be automatically discovered by users in a decentral-ized way, which is known as the Proof-of-Brain (PoB). How-ever, there are many bot accounts programmed to post au-tomatically and get rewards, which discourages real humanusers from creating good content. We call this type of bot aposting bot. While there are many papers that studied bots ontraditional centralized social media platforms such as Face-book and Twitter, we are the first to study posting bots on ablockchain-based social media platform. Compared with thebot detection on the usual social media platforms, the fea-tures we created have an advantage that posting bots can bedetected without limiting the number or length of posts. Wecan extract the features of posts by clustering distances be-tween blog data or replies. These features are obtained fromthe Minimum Average Cluster from Clustering Distance be-tween Frequent words and Articles (MAC-CDFA), which isnot used in any of the previous social media research. Basedon the enriched features, we enhanced the quality of clas-sification tasks. Comparing the F - scores , the features wecreated outperformed the features used for bot detection onFacebook and Twitter. Introduction
Despite the interest in blockchain technology, the usage ofthe so-called decentralized application (DApp) is still lim-ited. Except for transferring and trading cryptocurrencies,one of the most widely used applications is
Steemit , ablockchain-based social media platform. Based on a DAppranking site, Steemit had ranked first among all DApps for along time, and it still ranks sixth, and most DApps with high ∗ Corresponding authors. This paper will appear in the proceedings of ICWSM 2021. https://steemit.com ranks are based on the Steem blockchain (Steemit 2017), onwhich Steemit also runs.On Steemit, authors get author rewards in the formof cryptocurrencies called STEEM and SBD (SteemBlockchain Dollars) if their posts are upvoted. Interestingly,curators (or voters) also get rewards by voting others’ posts,which is called a curation reward . A user is an author ifshe writes a post, and is a curator if she votes a post (in-cluding her own posts). Rewards are proportional to a cu-rator’s staked amount of STEEMs, which is called STEEMPOWER. That is, an upvote from a user with more STEEMPOWER has a higher value. Each vote consumes a votingpower , which is regenerated as time goes by. There is alsoa downvote which decreases the reward of a post, whichis intended to prevent spams and any malicious content.Throughout this process, Steemit hopes “good” content tobe automatically discovered by users in a decentralized way,which is called the Proof-of-Brain (PoB).However, as on other traditional social media platformssuch as Facebook and Twitter, there are many bot accountsthat post automatically. We call this type of bot a posting bot .Detection of posting bots may be more critical on Steemitthan other platforms, because posting bots on Steemit alsoget rewards, which discourages real human users from cre-ating good content. Due to downvoting, bots that spam fre-quently cannot survive in terms of rewards. Therefore, post-ing bots have evolved in a way that they can write moremeaningful posts hence appearing like human accounts.There are many papers that studied bots on traditional so-cial media platforms. In particular, some studies detectedposting bots on Twitter. Twitter is a microblogging site onwhich users post messages called Tweets. A Tweet has a140 (or 280 since November 2017) character limit. Thus,a text in a tweet is short and relatively easy to analyze com-pared with other social media platforms such as Facebookand Steemit. On Steemit, there is no restriction on the lengthof a post, but it is limited by the block size, which is cur-rently 64KB. This is quite enough for most social mediaposts. Another complication of detecting bots on Steemit Converting STEEM to STEEM POWER is instant, but the re-verse takes 13 weeks. Any media files, e.g., pictures, videos, are uploaded to a tradi- a r X i v : . [ c s . S I] A ug s its high level of anonymity because of the decentralizednature of blockchain. Due to its financial rewards, relativelylong texts, high level of anonymity, it is both important andchallenging to detect posting bots on Steemit.To the best of our knowledge, this study is the first to in-vestigate posting bots on a blockchain-based social mediaplatform. Compared with the bot detection on the traditionalsocial media platforms, the features we created have an ad-vantage that they can be obtained without limiting the num-ber and length of posts. We extract the features by clusteringdistance between the blog data or replies. These features areobtained from the MAC-CDFA (Minimum Average Clusterfrom Clustering Distance between Frequent words and Arti-cles), which has not been used in any of the previous socialmedia research. This feature shows similarity between blogdata by clustering distances between blog data. Based onthe enriched features, we enhanced the classification quality.Comparing the F - scores , the features we created outper-formed the features used for the bot detection on Facebookand Twitter. Related Work
Detecting bots on social media platforms has become animportant issue with the growth of social media platforms(Allcott and Gentzkow 2017; Ferrara et al. 2016). Many re-searchers have tried to detect bots with machine learning al-gorithms (Abu-El-Rub and Mueen 2019; Chu et al. 2012;Clark et al. 2016; Dickerson, Kagan, and Subrahmanian2014; Santia, Mujib, and Williams 2019; Varol et al. 2017;Wang 2010). They have focused on extracting features thatrepresent patterns of behavior of each account. In particu-lar, features that represent the regularity have played a ma-jor role. Some researchers computed the similarity of textsposted by each user, and others measured an entropy of timeintervals to express regularity of behavior patterns. In a re-cent study, (Li and Palanisamy 2019) pointed out the preva-lence of a different type of bot from which users buy voteson Steemit. Most of these bots are easily found from theirown advertisements or by examining transfer memos thatcontain the post URL to be upvoted. In contrast, our focus isa posting bot.There have been several attempts to extract regularity oftexts, especially around Twitter (Abu-El-Rub and Mueen2019; Clark et al. 2016; Wang 2010). They used a methodthat considers all pairs of tweets, by defining the similaritybetween two tweets. Similarity between two tweets is de-fined in many ways. (Wang 2010) identified if a tweet is du-plicated by another, using the Levenshtein distance. (Abu-El-Rub and Mueen 2019) defined the similarity using theJaccard index of hashtags contained in each text. (Clark etal. 2016) considered the longest common sequence of twotexts. However, Twitter is different from Steemit in terms ofthe text length.In this respect, Facebook is a good example to comparewith Steemit. Users of both Steemit and Facebook can writelong articles. (Santia, Mujib, and Williams 2019) tried to tional cloud service, and only links to the media are included in thepost. detect social bots on Facebook with six features includingthe content-based features. However, they did not extract thefeatures that represent pairwise text similarity. Rather, theycomputed the innovation rate that represents the vocabularyof each account and also is used on a Twitter dataset (Clarket al. 2016). In addition, they proposed six features which areused on a Facebook dataset containing the innovation rate.Features related with the behavioral regularity also giveimportant information. (Chu et al. 2012) proposed an auto-mated account detection algorithm on Twitter, which mea-sures the entropy of tweeting time intervals. Showing thedifference in the distributions of the entropy for each typeof account, they emphasized that entropy measures areimportant for detecting automated accounts. In addition,(Chavoshi, Hamooni, and Mueen 2017) emphasized that itis important to consider temporal data to detect twitter bots.In this regard, we compute entropy from the sequence of thevarious activities including transfer of an account.Other studies have tried to extract various types of fea-tures. (Dickerson, Kagan, and Subrahmanian 2014) usedsentiment scores to design a social bot classifier by applyingthe Random Forest algorithm, using several features includ-ing the sematic metric. Feature importance extracted fromRandom Forest algorithm has revealed that semantic metricsplay an important role in detecting social bots on Twitter.(Varol et al. 2017) extracted six different types of features:metadata of users and friends, tweet content and sentiment,network patterns, and activity time series. They highlightedthe fact that human and bot accounts have diverse behaviorpatterns and concluded that 8 - 15 percent of accounts onTwitter are social bots.There are different approaches to detect bots on so-cial media platform. (Cresci et al. 2017; Feng et al. 2017;Lee and Kim 2014) defined different types of similaritiesused to detect social bots. (Cresci et al. 2017) defined theDigital DNA, which is the sequence of behaviors of a user,and computed the similarity of two sequences. (Lee andKim 2014) considered the similarity of user names. In ad-dition, (Feng et al. 2017) defined the similarity of users’relationships. Both applied each similarity to the hierar-chical clustering method. (Chavoshi, Hamooni, and Mueen2016) is a model to detect twitter bot based on the cluster-ing method. They used the lag-sensitive hashing techniqueand computed the Pearson correlation of posting time seriesbetween each user. Some applied anomaly detection meth-ods (Castellini, Poggioni, and Sorbi 2017; Minnich et al.2017). (Castellini, Poggioni, and Sorbi 2017) extracted fea-tures and applied them to a denoising autoencoder, a deeplearning algorithm, and (Minnich et al. 2017) employed anensemble method of anomaly detection. Some studies arebased on the graph structure (Cao et al. 2012; Wang, Zhang,and Gong 2017). Their approaches are based on the Ran-dom Walk (Cao et al. 2012) or Loop Belief Propagation(Wang, Zhang, and Gong 2017). (Boshmaf et al. 2015; El-Mawass, Honeine, and Vercouter 2018; H¨oner et al. 2017)mixed machine learning algorithms with graph based meth-ods. They defined the similarity between users (El-Mawass,Honeine, and Vercouter 2018) or adjusted each edge weightusing a machine learning method (Boshmaf et al. 2015;¨oner et al. 2017). For other studies on Steemit, see (Thel-wall 2018; Casadesus-Masanell, White, and Elterman 2019;Jeong 2020).
Feature Generation
In the feature generation section, we describe the featuresused in the classification. The features are divided into fourcategories. First, we develop the CDFA group that describesthe distance between frequently used words and articles.Second, (Santia, Mujib, and Williams 2019) analyzed thesocial bots on Facebook. Unlike Twitter, one can write ablog post on Facebook with unlimited characters, similarto Steemit. Therefore, we benchmark the features in (San-tia, Mujib, and Williams 2019) and call them a Santia-2019group. Third, (Chu et al. 2012) classified the accounts inTwitter into human, bot, cyborg using entropy rate, spamdetection, and account properties. However, some of the fea-tures are not available or not meaningful to detect the post-ing bots on Steemit. For example, the kind of twitting deviceor account verification features are not available on Steemit,and spam detection is not meaningful because out of four-teen spammers, only two appear to be posting bots, and theremaining twelve are humans. Consequently, we benchmarkthe entropy rate and some of the account properties from(Chu et al. 2012), and we denote them as Chu-2012 group.Finally, we added more features related to blockchain in or-der to observe a relation between blockchain and postingbots. We denote them as blockchain-oriented feature group.In this section, we generate four groups of features. We aimto study the difference in the effect of posting bot detectionbetween the features we created and the features in the pre-vious study (Chu-2012, Santia-2019). Also we want to ob-serve the difference in performances between features withand without the blockchain-oriented ones.
CDFA Group
We introduce the new features called the CDFA group torepresent the characteristics of words of a given account.First, to develop the new features, we introduce a clusteringmethod that considers a similarity between articles. Cluster-ing Distance between Frequent words and Articles (CDFA)is a method that transforms word data into real values. Weconsider the frequent words used by an account and mea-sure the distance between the frequent words and the articleswritten by the account.For the given data of the m articles written by an account,to extract the frequent words, we split the articles into wordswith a space. Let W j be the set of words in the j -th arti-cle, W = (cid:83) ≤ j ≤ m W j be the set of all words used in thearticles, and w i , ≤ i ≤ n , be the words in W . Further,for each article, we determine whether a word w i is used ornot. Then we obtain the occurrence vectors V j , ≤ j ≤ m ,with the length n in which an element in each vector V j rep-resents the word occurrence in the j -th article. We describethe occurrence vectors as follows: V j [ i ] = (cid:26) , w i ∈ W j , w i / ∈ W j . Next, we sum up the occurrence vectors and obtain a totaloccurrence vector T . More precisely, for each element in T ,the value represents the number of articles where the wordappears. Among the values in T , words having an occur-rence value of 10% or more of the maximum value in T aredefined as frequent words F and we obtain a vector V freq of length n that has value 1 on the frequent words, and 0otherwise: V freq [ i ] = (cid:26) , w i ∈ F , w i / ∈ F .
After we determine the vector V freq , we compute the Eu-clidean distance between V freq and V j and define the dis-tance as d j . For the m distances, we cluster them with theDirichlet Process Gaussian Mixture Model introduced in(Rasmussen 2000). In detail, a maximum number of clustersis set to five. Note that some of the clusters may not containenough data and the clusters are not appropriate to representthe writing patterns. In this case, we choose the clusters suchthat sizes of the clusters are at least m , where the denomi-nator comes from the maximum number of clusters.Using CDFA, we obtain the clusters of distances. Amongthem, we choose a cluster that has the minimum averageof the distances. Because posting bots tend to write articleswith a fixed form, words in the form would be in the frequentwords, and the distances between the frequent words and ar-ticles with fixed forms would be small. Therefore, we selectthe cluster with the minimum average of the distance anddenote the cluster as MAC-CDFA. From the MAC-CDFA,we extract the mean, variance of the MAC-CDFA, and thenumber of clusters that have the size at least m .Figure 1 shows the procedure of CDFA in brief. From thearticles, we extract frequent words and calculate the distancebetween the frequent words and articles. After that, we clus-ter the distances and choose the cluster that has the leastmean among the clusters.CDFA can be applied to various datasets. For a blog, thereare the title, content, and replies. We applied the CDFA tothe title, content, and replies that are written by an account,and we denote them as CDFA-T, CDFA-C, and CDFA-R,respectively. Similarly, we define the MAC-CDFA-T, MAC-CDFA-C, and MAC-CDFA-R. For each MAC-CDFA, weextract three features, thus there are nine features from theCDFA. We call the nine features CDFA feature group (orsimply CDFA features). For accounts with less than fiveblogs, the value of mean and variance of MAC-CDFA-T andMAC-CDFA-C are 0. In addition, the number of clusters viaCDFA-T and CDFA-C are 0. Similarly, for accounts withless than five replies, the mean and variance of MAC-CDFA-R and the number of clusters via CDFA-R are 0.Figure 2 shows distributions of features in the CDFAgroup. The top three graphs represent the mean of MAC-CDFA, three graphs in the middle represent the varianceof MAC-CDFA, and the bottom three graphs represent thenumber of clusters via CDFA. To observe meaningful data,we make histograms using accounts with five or more blogsfor the left six graphs, and accounts with five or more repliesfor the right three graphs because the excluded accountshave the value of 0.igure 1: CDFA and MA-CDFAIndex Feature Name1 Average of MAC-CDFA-T2 Variance of MAC-CDFA-T3 Number of clusters in CDFA-T4 Average of MAC-CDFA-C5 Variance of MAC-CDFA-C6 Number of clusters in CDFA-C7 Average of MAC-CDFA-R8 Variance of MAC-CDFA-R9 Number of clusters in CDFA-RTable 1: Features in CDFA GroupIn addition to our features, there are numerous other onesrelated to text similarity and natural language processing.In this section, we will compare the CDFA features withother text-related features: (i) frequent word counts, (ii) termfrequency–inverse document frequency (TF–IDF), (iii) Lev-enshtein edit distance, and (iv) word embedding.We obtained the frequent word count in the CDFA pro-cess. Concurrently, one of the standard methods to analyzetext content is TF–IDF, which assigns a weight to each wordin a document. In general, in TF–IDF, common words areallotted low weights, whereas uncommon ones have highweights. However, TF–IDF does not represent the featuresof an account. Assuming that an account frequently uses aword, whereas other accounts scarcely employ it, then theTF–IDF weight of this word will be higher than those of theother words employed in the content written by that account.However, the TF–IDF weight of a word will be relativelylow if the word was found in other documents. We calcu-lated the TF–IDF weights of those words occurred for tentimes or more.The Levenshtein edit distance is one of the standard ap-proaches to calculate the distance between two texts. Owingto the long computing time of the model, we sampled the accounts that wrote less than 500 posts and less than 500replies. Moreover, for a given account, we divided data intotitles of blogs, content of blogs, and replies, and calculatedthe mean of the pairwise distance of each categorized data.Word embedding is used in natural language processing.To conduct the word embedding, we collected pre-traineddataset in English, German, Spanish, Korean, French, andRussian from Fasttext(Joulin et al. 2016). In the pre-traineddataset, a word is assigned to a 300-dimensional vector. Wedenote the set of words in the pre-trained dataset as a bag ofwords . For a given text, we split it into words, and calculatethe average of vectors corresponding with the words whichare in the bag of words. We denote the average of vectorsas a text vector . For each account, we calculate a text vectorfor each blog content or replies, obtain an account vector asthe average of the text vectors, and use the account vectoras a feature. We sampled accounts that the number of wordscontained in both the bag of words and the words that theaccount used is 500 or more.In Table 2, we compare all the text-related features to theCDFA features. We used Random Forest classifier with Giniindex in the classification. The left scores in the table arethe F - scores of the features, and the right scores are thecorresponding F - scores of the CDFA features. Because anaccount set varies, the corresponding scores of the CDFAfeatures also vary. We observe that the CDFA features out-perform the other features.Features Score Score of CDFAFrequent word counting 62.78 83.72TF-IDF 79.38 83.72Levenshtein edit distance 73.01 83.22Word embedding 59.05 78.14Table 2: Comparison between text related features andCDFA Santia-2019 Group
In the Santia-2019 group, there are six features; average re-sponse time, average comment length, innovation rate, max-imum daily comments, number of links and thread deviation.
Average Response Time
Steemit users can leave com-ments on a blog or may leave replies to the comments lefton the blog. Moreover, users can leave replies to the replies.We introduce a depth of comments to explain this process.Blogs in Steemit are comments of depth 0. Comments lefton blogs are comments of depth 1. If you leave a reply on acomment of depth n , then your reply is of depth n +1 . Then,the comment of depth n you left a reply to is the parent replyto your reply. Response time measures how long each replyhas been created since a previous reply was made. Here, theprevious reply means a reply written just before the replyamong replies whose parent reply is the same. If a reply isthe first among them, we compute the time difference fromthe parent reply. Then, we obtain the response time of eachreply. Given a user, average response time is the average ofthe response times for all replies written by the user.igure 2: Distribution of Features in CDFA group Average Comment length
We generate features relatedto blogs and replies. One of them is the average commentlength. Same as (Santia, Mujib, and Williams 2019), someof the posting bots generate blog content or replies that arelong. We generate the average comment length by averag-ing lengths of all the blog content and replies written by anaccount.
Innovation Rate
One of the criteria for identifying hu-mans and bots is the diversity of words. To measure the di-versity of words, (Santia, Mujib, and Williams 2019) usedthe innovation rate that represents the decay rate of diver-sity of words. In (Clark et al. 2016), to detect the automa-tion on Twitter, they used the word introduction decay rate α ( n ) . In our case, the whole procedure is the same exceptthe shuffling. Because there are many blogs and replies forsome bots, we shuffled the words based on the articles. Thatmeans, for m articles, we shuffle the order of the articles,split them with space, and make the sequence of the words.Further, we shuffle three times to obtain the innovation rate. Maximum Daily Comments
Unlike ordinary accounts,bots can write many articles in a day using automated pro- grams. To deal with the bots that generate a massive numberof blogs or replies in a short period, we extract the maximumdaily comments.
Number of Links
We use a regular expression to extractthe strings that http or https contain. Even though a regularexpression is used, some strings could not be included in theURL in the middle, thus a URL validator is used to filterthem out.
Thread deviation
This feature represents a regularity ina user’s response patterns. We compute response times ofall replies left on a blog. Next, we check the average re-sponse time corresponding to the blog. Then, for each replyleft on the blog, we calculate a difference between the re-sponse time of the reply and the average response time ofthe blog. This difference is called deviation. We calculatethe deviation of replies written by a user. Finally, a threaddeviation of a user is defined as the average of deviation ofreplies written by the user.ndex Feature Name10 Average Response Time11 Average Comment length12 Innovation Rate13 Maximum Daily Comments14 Number of Links15 Thread deviationTable 3: Features in Santia-2019 Group
Chu-2012 Group
In Chu-2012 group, there are six features; entropy rate, hash-tag ratio, mention ratio, URL ratio, FF ratio and the age ofan account.
Entropy rate
The entropy rate ¯ H ( X ) is the conditionalentropy of an infinite random process X = { X i } , ¯ H ( X ) = lim n →∞ H ( X n | X n − , · · · , X ) , where the conditional entropy is computed as follows: H ( X n | X n − , · · · , X ) = H ( X , X , · · · , X n ) − H ( X , X , · · · , X n − ) , and an entropy of a sequence of random variables is definedas H ( X , · · · , X n ) = − n (cid:88) i =1 P ( X i = x i ) log P ( X i = x i ) . Here, we denote the above equation as an entropy formula .Because real data sets are finite, (Chu et al. 2012) useda corrected conditional entropy, denoted as
CCE , to esti-mate the entropy rate. First, they derived the joint probabil-ities, P ( X = x , · · · , X n = x n ) , empirically. Then, theycomputed the conditional entropy based on the empiricallyderived joint probability. This conditional entropy is de-noted by CE . Then, they added corrective terms per ( X n ) · EN ( X ) , where per ( C n ) is the percentage of unique se-quences of length n , and EN ( X ) is the entropy of X asfollows: CCE ( X n | X n − , · · · , X ) = CE ( X n | X n − , · · · , X )+ per ( X n ) · EN ( X ) . They determined n that minimizes CCE , and also com-puted the entropy rate of the sequence of tweeting intervalsof each user. We measured the entropy rate of the sequenceof comment time intervals and the time difference betweencomment actions.
Account Properties
As we mentioned at the beginning ofthe feature generation section, some features are available.In the case of blogs, tag data contains the tags of blogs thatrepresent the main topic of the blogs. Thus, we use a regularexpression to extract the hashtags. After obtaining the hash-tags, we calculate the hashtag ratio by dividing the numberof blogs and replies that contain mentions to the number ofblogs and replies. We also extract the mention ratio in a sim-ilar way to the hashtag ratio. In the case of the
URL ratio , we Index Feature Name16 Entropy rate17 Hashtag ratio18 Mention ratio19 URL ratio20 FF ratio21 The age of an accountTable 4: Features in Chu-2012 Groupcalculate it via processed data used to extract the number oflinks by using a similar approach to the hashtag ratio.Next, using the follower and following data, we calculatethe
FF ratio . We obtain the FF ratio by dividing the numberof followers by the sum of the number of followers and fol-lowings. If the number of followers and followings are 0, theFF ratio is 0. Finally, the age of an account is the differencebetween the time the account was created and the time at theend of the dataset.
Blockchain-Oriented Feature Group
Based on the blockchain system, we added 12 features,which are listed in Table 5, and term them blockchain-oriented (or simply blockchain) features. In this section, weintroduce the blockchain features.
Number of transfers is thesum of the transfers;
Daily time entropy of transfer is the en-tropy setting transfers per day as a random variable in theentropy formula;
Transfer activation time is the interval be-tween the first and last transfer times;
Daily transfer is ob-tained by dividing the number of transfers by the transfer ac-tivation time;
In-degree of transfer of an account is the num-ber of accounts that transferred to the account;
Out-degreeof transfer of an account is the number of accounts that theaccount transferred;
Entropy of the in-degree accounts of anaccount is the entropy setting accounts that are transferredto the account as a random variable in the entropy formula;
Entropy of the out-degree accounts of an account is the en-tropy setting accounts that the account transferred as a ran-dom variable in the entropy formula, and
Steem-created ac-count determines whether the account is created by Steem.Initially, to obtain average transfer per blog or reply , wecalculate the number of blogs or replies and the number oftransfers on each day. Subsequently, we divide the numberof blogs or replies into the number of transfers on each dayand obtain a feature by taking an average.
Average transferper blog and
Average transfer per reply are obtained simi-larly.
Posting bot classification
We explain how to classify the posting bots. First, we intro-duce a dataset. Second, we clarify an annotation process. Fi-nally, we describe the procedure of classification using sev-eral classifiers.
Dataset
Steemit is a social media platform based on the Steemblockchain. The Steem blockchain is a public blockchain;igure 3: F - score comparisonIndex Feature Name22 Number of transfers23 Daily time entropy of transfer24 Transfer activation time25 Daily transfer26 In-degree of transfer27 Out-degree of transfer28 Entropy of the in-degree accounts29 Entropy of the out-degree accounts30 Steem-created account31 Average transfer per blog or reply32 Average transfer per blog33 Average transfer per replyTable 5: Features in the Blockchain-oriented Feature Grouptherefore, all the data are publicly available. Using the datafrom February 2019 to December 2019, we manually classi-fied humans and bots. A total number of 984 accounts weredivided into 325 bot accounts and 659 human accounts. Forsampling, we collected the users that write blogs or repliesthat are 40 times or more. We describe the detailed labelingprocess in the annotation section.
Annotation
Because Steem is less explored in previous studies, anno-tation is one of the challenging tasks in our research. Twoannotators participated in the annotation, which consists oftwo stages. In the first stage, both the annotators label the ac-counts independently using the same dataset. Table 6 sum-marizes the results of the first stage. Subsequently, in thesecond stage, the annotators compare and discuss their la-bels. Remark that Cohen’s Kappa value is 90.23. Some ac-counts write several posts or replies like humans, but they However, there are some types of data that are not stored onthis blockchain. First, any media (e.g., pictures, videos, etc.) isstored in a typical centralized cloud service. Second, broadly, cer-tain activities (e.g., login, logout, and read) that are not publiclyavailable on steemit.com are not stored on the blockchain.
Annotators Annotator 1Bot HumanAnnotator 2 Bot 283 26Human 15 660Table 6: Annotation in the First Stage. The Cohen’s Kappavalue is 90.23.are suspicious of using the automated program in some postsor replies. We denote them as semi-automated accounts . Theannotators analyzed that the Cohen’s Kappa value is highbecause the subjective opinions of two annotators coincidein labeling semi-automated accounts as bots. The annotatorsagreed to establish the criteria for labeling to deal with thesemi-automated accounts and consider the disagreement of4.17% for all the accounts.Basic criteria for labeling bots were established to be gen-eral characteristic of a blog and reply data of the accounts,which are labeled as bots from both the annotators. To set thebasic criteria, we define the form . When multiple texts havethe same form, only the numbers, accounts, and links in thetexts change, and the rest is the same. For example, someaccounts have the form of “You got a [ n ]% upvote from [ac-count].” In this case, the changes are only in the number n and/or the account part. In addition, some accounts have theform of a table with the ranking of accounts according tosome criteria. In this scenario, changes occur only in the ac-counts in the table. Second, if an account has several formsand writes repeatedly using them, the account is labeled as abot. For example, when an account runs a gambling app, it isnecessary to set the forms such as the winner, amount won,amount they can bet on, and remained funds for gambling. Insummary, our basic criterion is that a bot has a certain formor forms in blogs or replies and writes ten or more times in arow using the form or forms. If the blog content or replies ofan account match the basic criterion, the account is labeledas a bot.However, there are accounts that our basic criterion maynot be adequately applied. Therefore, we establish some ex-eptions. An account that satisfies one of the following casesis labeled as a human: (i) leaving replies to participate in anevent or use a service that a bot cannot participate or use eas-ily, (ii) writing a link related to a game-play live streamingand having ten or more replies which do not satisfy the basiccriteria, (iii) reporting one’s workout records using workoutapp that has its own abusing detection system, (iv) postinga personal game app status and having ten or more replieswhich do not satisfy the basic criteria, and (v) posting pic-tures and having ten or more replies which do not satisfythe basic criteria. In contrast, an account that satisfies one ofthe following cases is labeled as a bot: (i) copying news andhaving less than ten replies; (ii) randomly rearranging shortsentences.The created CDFA features focus on text similarity. In thelabeling process, the basic criteria are related to text sim-ilarity. However, our labeling does not entirely depend ontext similarity. It also considers the opinions or experiencesof users. In addition, the labeling process considers copyingthe news or other types of bots that do not depend on textsimilarity. Classification Procedure
For the classification, we used several classifiers. In (San-tia, Mujib, and Williams 2019) and (Chu et al. 2012), Ran-dom Forest classifiers with Gini index and Entropy (Breiman2001), Linear Support Vector classifier (Cortes and Vapnik1995), and Decision Tree classifier (Breiman 2017) are usedto detect bots. In addition to them, we also used more classi-fiers based on boosting algorithms such as XGBoost (Chenand Guestrin 2016), LightGBM (Ke et al. 2017), and Ad-aBoost (Freund, Schapire, and Abe 1999). Also, we appliedthe multi-layer perceptron (MLP) classifier (Windeatt 2006)as a representative neural network. We denote Random For-est classifier with Entropy as RF-E, Random Forest classifierwith Gini index as RF-G, Linear Support Vector classifier asLSVC, XGBoost classifier as XGB, Decision Tree classifieras DTC, LightGBM as LGBM.In case of scores, we used the four traditional measure-ments;
Accuracy , P recision , Recall , and F - score .To generate the results, we performed five-fold cross-validation. First, we shuffled our dataset and divided itequally into five sets. Next, we choose the first set as thetest set, with the remainder becoming the training set. Inthe training set, we optimized the hyperparameters for eachclassification algorithm using a grid search via the five-foldcross-validation to obtain a high F - score . Applying the op-timized hyperparameters to the classification algorithms, weobtained the models and fit them to the test set and acquiredthe results. From the divided sets, we can choose five differ-ent test sets. Repeating the above procedure, we realized fivedifferent results for each model and obtained the final resultfor each model by taking the average. Results and Discussion
We compare the results obtained by employing the four fea-ture groups, and observe that the CDFA group outperformsthe other ones. In addition, we derive the results correspond-ing to the presence and absence of the blockchain features, to ensure that blockchain-oriented features are effective. Tointerpret the results, we consider the feature importance ofeach model and rank the features accordingly. For highlyranked features, we further analyze their characteristics.
Results
Note that we categorized the features into the four featuregroups: CDFA, Santia-2019, Chu-2012 and blockchain fea-tures. For convenience, Santia-2019 is denoted as S, andChu-2012 as C. Table 7 shows the results when only oneof the four feature groups is applied and the results whenblockchain-oriented features are excluded and included. InTable 7, we highlight the best scores among the four featuregroups for each classifier and the best scores among the clas-sifiers in the cases of including and excluding blockchain-oriented features respectively. We observe that the CDFAgroup classifies posting bots better than the other featuregroups. In addition, including blockchain-oriented featuresis more effective in detecting posting bots except for lin-ear support vector classifier, decision tree classifier and XG-Boost classifier. Finally, as we see in Table 7, Random Forestclassifier with entropy gives the best score in
Accuracy and F - score , Random Forest classifier with Gini index givesthe best score in P recision , and AdaBoost classifier givesthe best score in
Recall . Feature Importance
The tree-based ensemble models (Random Forest, DecisionTree, XGBoost, LightGBM, AdaBoost) provide the featureimportance. For a tree-based model, the classification isdone based on the features in the data set. Feature impor-tance provides information on how the features contributedto improve scores. In the classification procedure, we obtainsix different tables of feature importance from the six clas-sifiers. Because we calculate the F - score by averaging the F - scores of the five different test sets for each model, wealso calculate the feature importance by averaging the im-portance of the five different train sets.Figure 4 shows the results of the top 15 feature importanceof each model. The x axis represents the feature names.We observe that some of features in the CDFA group havehigh importance. As we see in Figure 4, only the qualita-tive analysis on the feature rank is available due to the scat-tered graphs. To determine the overall ranking, we considerthe Borda count in (Borda 1784) that is one of the popu-lar election methods. The Borda count changes the rank ofthe relative points and determines the final rank by summingthe relative points. There are many methods that change therank of the relative points. In this study, we use the Dowdallsystem that calculates the relative points with the recipro-cal of the rank. For the average of the feature importanceof each model, we determine the rank with respect to theimportance. Furthermore, we get the reciprocal of the ranksand sum them up. Finally, we obtain the sum of the relativepoints of the features and determined the rank. Table 8 showsthe top five features that have relative points higher than 1.We observe that three of them are in the CDFA group. TheFF ratio and the innovation rate are also in Table 8. We willanalyze the features in Table 8 further in the next section.odels Scores CDFA Santia-2019 Chu-2012 Blockchain CDFA + S + C AllRF-G Accuracy
Recall F Precision F LinearSVC Accuracy F F F F F F Feature Interpretation
In this section, we check the distributions of important fea-tures showed in Table 8, and a feature that has the highestrank among the blockchain features. Top left graph in theFigure 2 shows the distribution of the mean of MAC-CDFA-T for active users who post blogs five times or more. Theusers in the graph are normally distributed, and the distri-bution of posting bots has smaller means in the graph. Thisshows that posting bots tend to post with some forms in ti-tles.The top two histograms in Figure 5 show the log scaleddistributions of variance of MAC-CDFA-R and MAC- CDFA-T for active users. We see that the variances ofMAC-CDFA-R and MAC-CDFA-T of posting bots are zeromore often than humans. In contrast, the log scaled distribu-tions for humans resemble the normal distributions. Conse-quently, we infer that an active user is a posting bot whenthe variance of MAC-CDFA-T or MAC-CDFA-R is zero.(Chu et al. 2012) analyzed that automated bots on Twit-ter follow numerous users, expecting that humans will fol-low them in return. However, this scenario is reversed onSteemit. The lower left graph in Figure 5 presents the distri-bution of the FF ratio. We observe that FF ratios of variousposting bots are close to 1 each. This suggests that most ofthe posting bots do not follow other users. In contrast, hu-mans follow other users actively.A user who has a limited vocabulary has a high innovationrate, whereas a creative user has a low innovation rate. Thisis well-illustrated in the lower right in Figure 5. This dis-tribution displays that users with high innovation rates areposting bots, whereas those with low innovation rates arehumans.In contrast, the feature with the highest-ranking amongthe blockchain features is the out-degree of transfer, and itigure 4: Top 15 feature importance of classifiersFigure 5: Distributions of important featuress ranked 12th. Analysis of this feature demonstrated that 39accounts had an out-degree of transfer of more than 200, ofwhich 92.3% or 36 accounts were bots, and 7.7% or threeaccounts were humans. In addition, 11.1% of the bots andonly 0.5 % of the humans had an out-degree of transfer ofmore than 200. Observation of these accounts suggests thatthey need to transfer tokens to other accounts, such as run-ning games and events, or manage their tokens within theSteem blockchain. Therefore, from the out-degree of trans-fer feature, we infer that it assists in detecting these types ofbots.Overall, we observed that the behaviors of posting botsare different from those of humans in numerous aspects.Based on the CDFA features, we obtain the information ofthe texts close to the representative text structure of eachaccount. In fact, our results demonstrate that using a rep-resentative text structure is essential to detect posting bots.Considering the innovation rate, bots produce the same textswith little variations and have restricted vocabularies. In ad-dition, we find an extreme distribution of the FF ratio. Fromthe distribution, we infer that developing a relationship withother accounts is not a primary objective of posting bots.Among the blockchain features, the out-degree of trans-fer is essential in classification, and we detect some botsthat transfer a large amount of cryptocurrencies for runninggames or managing their tokens.
Conclusion
The problem of detecting posting bots is one of the essentialissues to avail more rewards to human users and motivatethem to generate good content. In this paper, we developedfeatures in a CDFA group to detect the posting bots. TheCDFA method is used to find frequent words in articles andto measure the distance between the frequent words and thearticles. Note that Steemit users can write blogs or replieswithout limit of length of words like on Facebook. To an-alyze the posting bot, it is necessary to deal with a largenumber of blogs or replies with unlimited length becausethey can generate many articles in a short period of time.Therefore, we calculate the similarity of articles by trans-forming the articles into real numbers and using a cluster-ing method that can deal with many blogs and replies. WithCDFA, we select the MAC-CDFA among the clusters ob-tained from CDFA and extract features from MAC-CDFA.To compare the performance of features, we benchmark thefeatures introduced in (Santia, Mujib, and Williams 2019)and (Chu et al. 2012), and use the F - score as a compari-son measure. The results show that the features in the CDFAgroup are more effective than other feature groups. To inter-pret the results, we calculated the feature importance and itsrank and performed further analysis of feature distribution.There is a limitation of our research. In our labeling pro-cess, annotations were rarely proceeded for languages thatthe annotators were not familiar with. In the future research,we expect that new features are developed and detect otherkinds of bots in blockchain-based social media platforms.For example, bid voting bots receive cryptocurrency and up-vote posts or replies. Although a list of such bots is avail-able, detecting such bots systematically and using them to improve posting bot detection quality would be of interest.Also, we will be able to improve the results by developingcustomized CDFA features for each language. Finally, weexpect that the CDFA features will be used to detect postingbots on social media platforms other than Steemit. Acknowledgement
H.J. Hwang was supported by the National Research Foun-dation of Korea (NRF) grant funded by the Korea gov-ernment (MSIT) (2017R1E1A1A03070105) and by Insti-tute for the Information and Communications TechnologyPromotion (IITP) grant funded by the Korea government(MSIP) (No.2019-0-01906, Artificial Intelligence GraduateSchool Program (POSTECH)) and by the ITRC (Informa-tion Technology Research Center) support program (IITP-2018-0-01441).
References [Abu-El-Rub and Mueen 2019] Abu-El-Rub, N., andMueen, A. 2019. Botcamp: Bot-driven interactions insocial campaigns. In
The World Wide Web Conference ,2529–2535. ACM.[Allcott and Gentzkow 2017] Allcott, H., and Gentzkow, M.2017. Social media and fake news in the 2016 election.
Jour-nal of economic perspectives
Histoire de l’Academie Royale des Sciences pour1781 (Paris, 1784) .[Boshmaf et al. 2015] Boshmaf, Y.; Logothetis, D.; Siganos,G.; Ler´ıa, J.; Lorenzo, J.; Ripeanu, M.; and Beznosov, K.2015. Integro: Leveraging victim prediction for robust fakeaccount detection in osns. In
NDSS , volume 15, 8–11.[Breiman 2001] Breiman, L. 2001. Random forests.
Ma-chine learning
Classification and re-gression trees . Routledge.[Cao et al. 2012] Cao, Q.; Sirivianos, M.; Yang, X.; andPregueiro, T. 2012. Aiding the detection of fake accountsin large scale social online services. In
Proceedings of the9th USENIX conference on Networked Systems Design andImplementation , 15–15. USENIX Association.[Casadesus-Masanell, White, and Elterman 2019]Casadesus-Masanell, R.; White, A.; and Elterman, K.2019. Steemit: A new social media? Harvard BusinessSchool Case 720-428.[Castellini, Poggioni, and Sorbi 2017] Castellini, J.; Pog-gioni, V.; and Sorbi, G. 2017. Fake twitter followersdetection by denoising autoencoder. In
Proceedings of theInternational Conference on Web Intelligence , 195–202.ACM.[Chavoshi, Hamooni, and Mueen 2016] Chavoshi, N.;Hamooni, H.; and Mueen, A. 2016. Debot: Twitter botdetection via warped correlation. In
ICDM , 817–822.[Chavoshi, Hamooni, and Mueen 2017] Chavoshi, N.;Hamooni, H.; and Mueen, A. 2017. Temporal patternsin bot activities. In
Proceedings of the 26th Internationalonference on World Wide Web Companion , 1601–1606.International World Wide Web Conferences SteeringCommittee.[Chen and Guestrin 2016] Chen, T., and Guestrin, C. 2016.Xgboost: A scalable tree boosting system. In
Proceedings ofthe 22nd acm sigkdd international conference on knowledgediscovery and data mining , 785–794. ACM.[Chu et al. 2012] Chu, Z.; Gianvecchio, S.; Wang, H.; and Ja-jodia, S. 2012. Detecting automation of twitter accounts:Are you a human, bot, or cyborg?
IEEE Transactions onDependable and Secure Computing
Journal ofComputational Science
Machine learning
IEEE Transactions on Dependable andSecure Computing
Proceedings of the 2014 IEEE/ACMInternational Conference on Advances in Social NetworksAnalysis and Mining , 620–627. IEEE Press.[El-Mawass, Honeine, and Vercouter 2018] El-Mawass, N.;Honeine, P.; and Vercouter, L. 2018. Supervised classifi-cation of social spammers using a similarity-based markovrandom field approach. In
Proceedings of the 5th Multi-disciplinary International Social Networks Conference , 14.ACM.[Feng et al. 2017] Feng, B.; Li, Q.; Pan, X.; Zhang, J.;and Guo, D. 2017. Groupfound: An effective ap-proach to detect suspicious accounts in online social net-works.
International Journal of Distributed Sensor Net-works
Communications of the ACM
Journal-Japanese Society For Artificial Intelligence
Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 ,1520–1528. JMLR. org.[Jeong 2020] Jeong, S. E. 2020. Centralized decentral-ization: Does voting matter? simple economics of gover-nance attacks on the DPoS blockchain. Available at SSRN:https://ssrn.com/abstract=3575654. [Joulin et al. 2016] Joulin, A.; Grave, E.; Bojanowski, P.; andMikolov, T. 2016. Bag of tricks for efficient text classifica-tion. arXiv preprint arXiv:1607.01759 .[Ke et al. 2017] Ke, G.; Meng, Q.; Finley, T.; Wang, T.;Chen, W.; Ma, W.; Ye, Q.; and Liu, T.-Y. 2017. Lightgbm: Ahighly efficient gradient boosting decision tree. In
Advancesin Neural Information Processing Systems , 3146–3154.[Lee and Kim 2014] Lee, S., and Kim, J. 2014. Early filter-ing of ephemeral malicious accounts on twitter.
ComputerCommunications
Proceedings of the 10th ACM Con-ference on Web Science , 145–154.[Minnich et al. 2017] Minnich, A.; Chavoshi, N.; Koutra, D.;and Mueen, A. 2017. Botwalk: Efficient adaptive explo-ration of twitter bot networks. In
Proceedings of the 2017IEEE/ACM International Conference on Advances in SocialNetworks Analysis and Mining 2017 , 467–474. ACM.[Rasmussen 2000] Rasmussen, C. E. 2000. The infinitegaussian mixture model. In
Advances in neural informationprocessing systems , 554–560.[Santia, Mujib, and Williams 2019] Santia, G. C.; Mujib,M. I.; and Williams, J. R. 2019. Detecting social bots onfacebook in an information veracity context. In
Proceed-ings of the International AAAI Conference on Web and So-cial Media , volume 13, 463–472.[Steemit 2017] Steemit. 2017. Steem: An incentivized,blockchain-based, public content platform.[Thelwall 2018] Thelwall, M. 2018. Can social news web-sites pay for content and curation? the steemit cryptocur-rency model.
Journal of Information Science
Eleventh international AAAI conference on web and socialmedia .[Wang, Zhang, and Gong 2017] Wang, B.; Zhang, L.; andGong, N. Z. 2017. Sybilscar: Sybil detection in onlinesocial networks via local rule based propagation. In
IEEEINFOCOM 2017-IEEE Conference on Computer Commu-nications , 1–9. IEEE.[Wang 2010] Wang, A. H. 2010. Detecting spam bots in on-line social networking sites: a machine learning approach. In
IFIP Annual Conference on Data and Applications Securityand Privacy , 335–342. Springer.[Windeatt 2006] Windeatt, T. 2006. Accuracy/diversity andensemble mlp classifier design.