[PDF] ENWalk: Learning Network Features for Spam Detection in Twitter

Abstract

Social medias are increasing their influence with the vast public information leading to their active use for marketing by the companies and organizations. Such marketing promotions are difficult to identify unlike the traditional medias like TV and newspaper. So, it is very much important to identify the promoters in the social media. Although, there are active ongoing researches, existing approaches are far from solving the problem. To identify such imposters, it is very much important to understand their strategies of social circle creation and dynamics of content posting. Are there any specific spammer types? How successful are each types? We analyze these questions in the light of social relationships in Twitter. Our analyses discover two types of spammers and their relationships with the dynamics of content posts. Our results discover novel dynamics of spamming which are intuitive and arguable. We propose ENWalk, a framework to detect the spammers by learning the feature representations of the users in the social media. We learn the feature representations using the random walks biased on the spam dynamics. Experimental results on large-scale twitter network and the corresponding tweets show the effectiveness of our approach that outperforms the existing approaches

Full PDF

EENWalk: Learning Network Features for Spam Detection in Twitter

Santosh K C , Suman Kalyan Maity , Arjun Mukherjee University of Houston, IIT Kharagpur [email protected], [email protected], [email protected]

Abstract.

Social medias are increasing their influence with the vast public infor-mation leading to their active use for marketing by the companies and organiza-tions. Such marketing promotions are difficult to identify unlike the traditional medias like TV and newspaper. So, it is very much important to identify the pro-moters in the social media. Although, there are active ongoing researches, exist-ing approaches are far from solving the problem. To identify such imposters, it is very much important to understand their strategies of social circle creation and dynamics of content posting. Are there any specific spammer types? How suc-cessful are each types? We analyze these questions in the light of social relation-ships in Twitter. Our analyses discover two types of spammers and their relation-ships with the dynamics of content posts. Our results discover novel dynamics of spamming which are intuitive and arguable. We propose

ENWalk , a framework to detect the spammers by learning the feature representations of the users in the social media. We learn the feature representations using the random walks biased on the spam dynamics. Experimental results on large-scale twitter network and the corresponding tweets show the effectiveness of our approach that outper-forms the existing approaches.

Keywords:

Social Network; Spam Detection; Feature Learning Introduction

Social medias are increasing their influence tremendously. Twitter is one of the pop-ular platforms where people post information in the form of tweets and share the tweets. Twitter is available from wide range of web-enabled services to all the people. So, the real time reflection of a society can be viewed in twitter. Celebrities, governments, pol-iticians, businesses are active in twitter to provide their updates and to listen to the views of the people. Thus, the bidirectional flow of information is high. The openness of the online platforms and reliance on users facilitates the spammers to easily penetrate the platform and overwhelm the users with malicious intent and content. This work attempts to detect the spammers in social network using a case study of twitter. Spammers in social networks constantly adapt to avoid the detection. Moreover, they follow reflexive reciprocity [6, 17] (users following back when they are followed by someone to show courtesy) to establish social influence and act normal. So, it is be-coming difficult for traditional spam detection methods to detect the spammers. Such pammers have widespread impacts. There are several reports of army of fake Twitter accounts being used to troll and promote political agendas . Even US President Don-ald Trump has been accused of fake followers . In this paper, we present ENWalk, a framework that uses the content information to bias a random walk of the network and obtain the latent feature embedding of the nodes in the network. ENWalk generates the biased random walks and uses them to maximize the likelihood of obtaining similar nodes in the neighborhood of the network. We study the twitter content dynamics that could be important to bias those random walks. We found that there are two types of spammers: follow-flood and vigilant . We found that success rate, activity window, fraudulence and mentioning behaviors can be used to compare the equivalence of users in the twitter. We calculate the network equivalence using these four behavioral features between pairs of nodes and try to bias the random walks with interaction proximity of the pair of nodes. Experimental results on 17 mil-lion user network from twitter show that the combination of behavioral features with the underlying network structure significantly outperform the existing state-of-the-art approaches for deception detection. Related Work

There have been several works on spam detection in general, especially review spam [7], and opinion spam. However, in Twitter there are limited attempts. One of the ear-liest works was done by Benevenuto et al. [1]. They manually labeled and trained a traditional classifier using the features extracted from user contents and behaviors. Lee et al. leveraged profile-based features and deployed social honeypots to detect new so-cial spammers [9]. Stringhini et al. also studied spam detection using honey profiles [14]. Ghosh et al. studied the problem of link farming in Twitter [4] and introduced a ranking methodology to penalize the link farmers. Abuse of online social networks was studied in [16]. Campaign spams was studied on [3, 10, 19]. Skip-gram model [12] has been popular to learn the features from a large corpus of data. It inspired to establish an analogy for networks by representing a network as a “document”. Similar to document being an ordered sequence of words, we can create an ordered sequence of nodes from a network using sampling techniques. DeepWalk [13] learns d-dimensional feature representations by simulating uniform random walks. LINE [15] learns the d-dimensional features into two phases: d/2 BFS-style simulations and another d/2 2-hop distant nodes. Node2vec [5] creates the ordered sequence simu-lating the BFS and DFS approaches. All these feature learning approaches don’t use the data associated with node which are important to learn the behaviors of the nodes. http://theatln.tc/2m8g3eA http://bzfd.it/2m8rlja http://bit.ly/2kJiMKu http://bit.ly/1ViorHd, http://53eig.ht/2kzrhfL Dataset

For this work, we use the Twitter dataset used in [18]. It contains 17 million users hav-ing 467 million Twitter posts covering a seven month period from June 1 2009 to De-cember 31 2009. To extract the network graph for those 17 million users, we extracted the follower-following topology of Twitter from [8] which contains all the entire twitter user profiles and their social relationships till July 2009. We pruned the users so that they have social relationship in [8] and tweets in [18] and are left with 4,405,698 users. Twitter suspends the accounts involved in the malicious activity ( https://support.twit-ter.com/articles/18311 ). To obtain the suspend status of accounts, we re-crawled the profile pages of all the 17 million users. This yielded a total of 100,758 accounts that had been suspended (the profile page redirects to the page https://twitter.com/ac-count/suspended ). We use this suspension signal as the primary signal for evaluating our models as the primary reason for account suspension is the involvement in the spam activity. However, there might be other reasons like inactivity. So, to ensure the sus-pended accounts are spammers, we further checked for malicious activities for those users. For this, we examined various URLs from the account’s timeline and checked them against a list of blacklisted URLs. We use three blacklists: Google Safebrowsing ( http://code.google.com/apis/safebrowsing/ ), URIBL ( http://uribl.com/ ) and Joewein ( ). We found that 75% of suspended accounts posted at least one shortened URL blacklisted. We also looked for duplicate tweets enforced for pro-motion. After applying these additional criteria, our final data comprised of 86,652 spammers and 4,319,046 non-spammers, which was used for evaluating our model. Spam Analysis

Characterizing the dominant spammer types is important as it is the first step in under-standing the dynamics of spamming. We studied the follower-following network crea-tion strategies of the spammers. We found that there are two main types of spamming based on the follow-following strategies: (1) follow-flood spammers and (2) vigilant spammers. So, the question arises why some spammers are more successful? In this section, we study the behavioral aspects of tweet dynamics of spammers. We later lev-erage them in model building.

Spammer Type

To analyze the strategies of follower-following, we calculated the number of followers (users that are following the current user) and the number of followings (users that the current user is following) for each spammer. Figure 1 shows the plot in log scale count. It shows that the follower and following count differ for each spammers. The users with more followers than followings tend to be more successful as they have been able to “earn” a lot of users who are following them. So, we define success rate as: 𝑠𝑟 𝑢 = Based on the network expansion success rate, we find that there are two dominant spam-ming strategies:

Follow-flood

Spammers: These are less successful spammers who just flood the net-work with friendship initiation so as to get followers who they can influence. We cate-gorize the spammers with success rate ( 𝑠𝑟 𝑢 ) less than 1 in this type.  Vigilant

Spammers: These are successful spammers who take a cautious approach of friendship creation and content posting. Spammers with success rate ( 𝑠𝑟 𝑢 ) greater or equal to 1 are categorized as vigilant . To learn the dynamics of each spammer type, we further analyzed the success rate of spammers with other behavioral aspects – activity window, usage of promotion words or blacklist words and hashtag mentioning. Activity Window

We compute the activity window as the number of days a user is active in the twitter network. Since, we don’t have the exact time when a user was suspended, we approxi-mate the time of suspension as the date of the last tweet tweeted by the user. We found that the average activity window of a vigilant spammer is 138 days with a standard deviation of 19 days compared to the average of 35 days and standard deviation of 12 days for follow-flood spammers. Although, the basic strategy of any spammer is to in-ject itself into the network and emit the spam contents, the success rate also depends how long it can remain undetected in the network. So, vigilant spammers have a higher success rate.

Fraudulence

One of the primary reason to spam is to inject constant fraudulence information. So, we analyzed the fraudulence behavior of the two types of spammers. We labeled the tweets containing promotional, adult words or the blacklisted urls as fraud tweets. So, we com-pute fraudulence as: 𝑓𝑟 𝑢 = We found that the average fraudulence of vigilant spammers is 0.34 compared to 0.86 of follow-flood spammers. So, the follow-flood spammers are more involved in spam.

Mentioning Celebrities and Popular Hashtags

Mentioning the popular celebrities or hashtags empowers a tweet. So, one of the com-mon strategies of spammers is to include the popular ones in their tweets. We studied mentioning phenomenon and found that vigilant spammers mention half the celebrities per tweets compared to the follow-flood spammers.

Figure 1. Follower-Following Count of

Spammers. Each Blue dot represents a spammer and the red line is the plot of y=x line.

Learning Latent Features for Spam Detection

Having characterized the dynamics of spamming in Twitter, can we improve spam de-tection beyond the existing state-of-the-art approaches? To answer this we used our Twitter data to setup a latent feature learning problem in networks. Our analysis is gen-eral and can be used to any social network.

Overview

As discussed in the previous section, the dynamics of Twitter are interesting and can be leveraged to catch the spammers. So, we use the spam dynamics to formulate the latent feature learning in social networks. Let

𝐺 = (𝑉 , 𝐸, 𝑋) be a given network with vertices, edges and the social network data of users in the social network. We aim to learn a mapping function 𝑓 ∶ 𝑉 → ℝ 𝑑 from nodes to a d-dimensional feature represen-tations which can be used for prediction. The parameter (cid:1856) specifies the number of di-mensions of the latent features such that the size of 𝑓 is |𝑉 | × 𝑑 . We present a novel sampling strategy that samples nodes in network exploiting the spam dynamics such that the equivalent neighborhood 𝐸𝑁(𝑢) ⊂ 𝑉 contains the node having similar tweeting behaviors with the node 𝑢 . We generate 𝐸𝑁(𝑢) for each nodes in the network and predict which nodes are the members of 𝑢 ’s equivalent neighbors based on the learnt latent features 𝑓 . The basic rationale is that we wish to learn latent feature representations for nodes that respect equivalent neighborhoods (which are based on the spamming dynamics) so that classification/ranking using the learned rep-resentation yields results that leverage the spamming dynamics. The Optimization Problem

As our goal is to learn the latent features 𝑓 that best describe the equivalent neighbor-hood 𝐸𝑁 (𝑢) of node 𝑢 , we define the optimization problem as follows: ∑ 𝑙𝑜𝑔 𝑃 𝑟(𝐸𝑁(𝑢)∣𝑓(𝑢)) 𝑢 ∈𝑉𝑓 𝑚𝑎𝑥 (3) To solve the optimization problem, we extend the SkipGram architecture [5, 13, 15] which approximates the conditional probability using an independence assumption that the likelihood of observing an equivalent neighborhood node is independent of observ-ing any other equivalent neighborhood given the latent features of the source node.

𝑃 𝑟(𝐸𝑁(𝑢)∣𝑓(𝑢)) = ∏ Pr (𝑣|𝑓(𝑢) 𝑣 ∈𝐸𝑁(𝑢) (4)

Since, the source node and the equivalent neighborhood node have symmetric equiv-alence, the conditional likelihood can be modeled as softmax unit parameterized by a dot product of their features.

𝑃𝑟(𝑣∣𝑓(𝑢)) = 𝑒𝑥𝑝( 𝑓(𝑣). 𝑓(𝑢))∑ 𝑒𝑥𝑝 (𝑓(𝑡). 𝑓(𝑢)) 𝑡∈𝑉 (5) he optimization problem now becomes: ∑ [−𝑙𝑜𝑔 𝑍 𝑢 + ∑ 𝑓(𝑡). 𝑓(𝑢) 𝑡 ∈𝐸𝑁(𝑢) ] 𝑢 ∈𝑉𝑓 𝑚𝑎𝑥 (6) For large networks, the partition function 𝑍 𝑢 = ∑ exp (𝑓(𝑡). 𝑓(𝑢)) 𝑡∈𝑉 is expensive to compute. So, we use negative sampling [12] to approximate it. We use stochastic gradient descent over the model parameters defining the features 𝑓 . Feature learning methods based on Skip-gram architecture are developed for natural language [11]. Since natural language texts are linear, the notion of a neighborhood can be naturally defined using a sliding window over consecutive words in sentences. Networks are not linear, and thus a richer notion of a neighborhood is needed. To mitigate this problem, we use multiple biased random walks each one in principle exploring a different neigh-borhood [5]. Equivalent Neighborhood Generation

The analyses of spam dynamics leads to an important inference that the nodes are sim-ilar if they have similar spam dynamics. So, we want to exploit those dynamics to gen-erate the equivalent neighborhood

𝐸𝑁(𝑢) for the node 𝑢 . Nodes in a network are equivalent if they share similar behaviors. We use the random walk procedure which can be biased to generate the equivalent neighborhood. We bias the random walks based on the four dynamics: common time of activ-ity (𝑐𝑡 𝑡𝑣 ), success rate difference ( 𝑠𝑟 𝑡𝑣 ), fraudulence commonalities ( 𝑓𝑟 𝑡𝑣 ) and com-mon mentioning in tweets ( 𝑚𝑒 𝑡𝑣 ). We calculate each dynamics as follows: 𝑐𝑡 𝑡𝑣 = 𝑡𝑣 = 1 − ∣max (1, 𝑡𝑣 = 1 − ∣ 𝑡𝑣 = common mentions between 𝑡 and 𝑣total mentions of 𝑡 and 𝑣 (10) For all the above four features, a higher value represents a closer connection between the pair of nodes. For a source node 𝑢 , we generate a random walk of fixed length 𝑘 . The 𝑖 𝑡ℎ node 𝑐 𝑖 of a random walk starting at node 𝑐 is generated with the distribution: 𝑃 (c 𝑖 = 𝑡 | c 𝑖−1 = 𝑣) = {ℬ 𝑣𝑡 , 𝑖𝑓(𝑣, 𝑡) ∈ 𝐸0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (11) where ℬ 𝑣𝑡 is the normalized transition probability between nodes 𝑣 and 𝑡 . The tran-sition probability are computed based on the spam dynamics so that the source node has equivalent spam dynamics with its neighborhood nodes. e define four parameters which guide the random walk. Consider that a random walk just traversed edge (𝑡, 𝑣) to now reside at node 𝑣 . The walk now needs to decide on the next step so it evaluates the transition probabilities on edges (𝑣, 𝑥) leading from 𝑣 . We set the transition probability to ℬ 𝑣𝑥 = 𝛼 𝑝𝑞𝑟𝑠 (𝑡, 𝑣, 𝑥). 𝑤 𝑣𝑥 , where 𝛼 𝑝𝑞𝑟𝑠 (𝑡, 𝑣, 𝑥) = 𝑝. (𝑐𝑡 𝑡𝑣 + 𝑐𝑡 𝑣𝑥 ) + 𝑞. (𝑠𝑟 𝑡𝑣 + 𝑠𝑟 𝑣𝑥 ) + 𝑟. (𝑓𝑟 𝑡𝑣 + 𝑓𝑟 𝑣𝑥 ) + 𝑠. (𝑚𝑒 𝑡𝑣 + 𝑚𝑒 𝑣𝑥 ) (12) where the parameters 𝑝, 𝑞, 𝑟 , 𝑠 are used to prioritize the tweet dynamics. To select the next node, the random walk is biased towards the nodes which have similar tweet dy-namics to both the current node and the previous node in the random walk.

Algorithm: ENWalk

Algorithm 1 details our entire scheme. We start with 𝜆 fixed length random walks at each node 𝑙 times. To obtain each walk, we use GetEquivalentNeighbor , the random sam-pler that samples the node based on the transition probabilities computed in equation 12. It is worth noting that the tweet dynamics between the nodes ( 𝐶𝑇 , 𝑆𝑅, 𝐹𝑅, 𝑀𝐸 ) defined in equation 7, 8, 9, 10 respectively can be pre-computed. Once, we have random walks we can obtain 𝑑 dimensional numeric features using the optimization function in equation 6 with a window size of 𝑘 . The three phases preprocessing, random sampling and optimization are asynchronous so that ENWalk is scalable. Experiment

We applied ENWalk to twitter dataset to evaluate its effectiveness. In this section, we discuss the baseline methods and compare with ENWalk for classification and ranking.

Baseline Methods

For classification, we compare our model with two graph embedding methods: Deep-walk and node2vec. We use PageRank and Markov Random Field (MRF) approaches

Algorithm 1:

ENWalk (cid:4666)

𝐺, 𝑑, 𝜆, 𝑙, 𝑘, [𝑝, 𝑞, 𝑟, 𝑠] (cid:4667) Input: graph

𝐺(𝑉 , 𝐸, 𝑊 , 𝑋) embedding dimensions 𝒅 walks per node 𝜆 walk length 𝑙 context size 𝑘 tweet parameters 𝒑, 𝒒, 𝒓, 𝒔 Output : matrix of latent features 𝑭 ( (cid:2159)(cid:2176), (cid:2175)(cid:2174), (cid:2162)(cid:2174), (cid:2169)(cid:2161) ) = Preprocess (cid:4666) 𝑮, 𝒑, 𝒒, 𝒓, 𝒔) Initialize 𝒘𝒂𝒍𝒌𝒔 to empty 3. for 𝑖 = 𝟏 to 𝝀 do for each 𝑣 𝑖 𝜺 𝑽 do Initialize 𝒘𝒂𝒍𝒌 to 𝑣 𝑖 for 𝑗 = 𝟏 to 𝒍 do 𝒙 = GetEquivalentNeighbor( 𝐺, 𝐶𝑇 , 𝑆𝑅, 𝐹𝑅, 𝑀𝐸, 𝑤𝑎𝑙𝑘[𝑗], 𝑊 ) 8.

Append 𝒙 to 𝒘𝒂𝒍𝒌 Append 𝒘𝒂𝒍𝒌 to 𝒘𝒂𝒍𝒌𝒔 𝑭 = StochasticGradientDescent (cid:4666) 𝒌 , 𝒅 , 𝒘𝒂𝒍𝒌𝒔 (cid:4667) or ranking of spam nodes. We did not use feature ex-traction techniques like [1] as they only use the node features without using the graph structure. Deepwalk [13] . It is the first approach to integrate the language modeling for network feature representation. It generates uniform random walks equivalent to sen-tences in the language model.

Node2vec [13] . It is another representation learning for nodes in the network. It extends the language model of random walks employing a flexible notion of neighborhood. It designs a biased random walk using BFS and DFS neighborhood discovery.

PageRank Models.

PageRank is a popular ranking algorithm that exploits the link-based structure of a network graph to rank the nodes of the graph.

𝑃𝑅 = (1 − α) ∗ 𝑀 ∗ 𝑃𝑅 + α ∗ 𝑝 (13) where 𝑀 is transition probability matrix, 𝑝 represents the prior probability with which a random surfer surfs to a random page and 𝛼 is damping factor. For variations of PageRank, we vary the values of 𝑀 and 𝑝 using trustworthiness of a user. Trustwor-thiness (𝑓 𝑇𝑟𝑢𝑠𝑡 ) is using a set of features ( 𝑓 𝑇𝑟𝑢𝑠𝑡 score of 800 users (400 non-suspended and 400 suspended). We gave a real-valued trustworthiness score between 0 and 1. A value closer to 0 means the user is most likely a spammer. We then obtain the weight of the features by learning linear regression model on the users.  Traditional PageRank

We use the default PageRank settings for 𝑀 and 𝑝 .  Trust Induced and Trust Prior:

Transition matrix 𝑀 is modified as 𝑀 𝑢𝑣 = M 𝑢𝑣 ∗𝑓 𝑇𝑟𝑢𝑠𝑡 (𝑣), ∀𝑢, ∀𝑣 and 𝑓 𝑇𝑟𝑢𝑠𝑡 (𝑣) is used as prior probability.

Markov Random Field Models.

Markov Random Fields are undirected graphs (and can be cyclic) that satisfy the three conditional independence properties (Pairwise, Lo-cal, and Global). For the inference, we use the Loopy Belief Propagation algorithm. Inspired by spam detection in [2], we define 3 hidden states {Spammer, Mixed, Non-Spammer} and the Propagation Matrix is used as in Table 1. Logically, spammers fol-low other spammers more (hence 0.8 probability) and non-spammers tend to follow other non- spammers. We also include the mixed state to include those users who are difficult to categorize spammers or non-spammers.

Node Classification

We obtained the feature representations from three different algorithms: ENWalk, node2vec and DeepWalk using the settings used in node2vec and DeepWalk. All the feature learnings are unsupervised. Similar to node2vec and DeepWalk, we used 𝑑 =128, 𝜆 = 10, 𝑙 = 80, 𝑘 = 10 . We found that the parameters 𝑑, 𝜆, 𝑙, 𝑘 are sensitive in a similar style to node2vec and DeepWalk. We used each feature representation as an example for standard SVM classifier. We used 10-fold cross-validation using balanced Table 1. Propagation Matrix for (S)pammer, (M)ixed, (N)on-Spammer

S M N S 0.80 0.40 0.025 M 0.15 0.50 0.125 N 0.05 0.10 0.850 ata obtained from sub-sampling of the nega-tive class. From the classification results in Table 2, ENWalk performs better. It has higher precision, recall, F1-score and accu-racy due to the biased random walks.

Node Ranking

We use two metrics to evaluate the ranking re-sults:

Cumulative Distribution Function of Suspended Users and

Precision@n . We rank all the nodes in the graph and provide a node rank percentile. For each node rank percentile, we compute the number of suspended users in that percentile. We plot the cumulative distri-bution function for those suspended users. We also calculate the Area Under Curve (AUC) for the CDF. The higher the area the better the model. Precision@n of Suspended Users evaluates how many top n nodes suggested by a model are actually the suspended users. This is effective to screen the nodes that are probable being spammers. To evaluate the ranking performance of ENWalk, we use Logistic Regression on the features obtained from the model. We compare our model with PageRank and Markov Random Field models. We present the CDF in Fig 2. We can see that ENWalk outper-forms all the baseline models. We also computed the AUC and precision@100 (Table 3). A higher AUC and precision@100 signifies the ability to profile the top spammers. Conclusion

We studied the problem of identifying spammers in Twitter who are involved in mali-cious attacks. This is very much important as it has many practical applications in to-day’s world where almost everyone is actively social online. This paper proposed a method of spam detection in Twitter that makes use of the online network structure and information shared. This data driven approach is important as there is a lot of data of social medias online these days. We demonstrated the helpfulness of biased random walks in learning node embedding that can be used for classification and ranking tasks.

Acknowledgements : This work is supported in part by NSF 1527364. We also thank anonymous reviewers for their helpful feedbacks.

Figure 2. Cumulative Distribution Function of Suspended Nodes Table 3. Ranking Results: Area Under CDF Curve (AUC) and Preci- sion@100(P@100)

Model AUC P@100

PR-T 0.4059 0.02 PR-TITP 0.4181 0.03 MRF 0.4944 0.02DeepWalk 0.5502 0.05 node2vec 0.5836 0.05

ENWalk 0.6335 0.12

Table 2. Classification Results: Precision (P), Recall (R), F1-score (F) and Accuracy (A)

Model P R F A

DeepWalk 0.44 0.49 0.46 0.51 Node2vec 0.46 0.53 0.49 0.57 ENWalk 0.59 0.66 0.62 0.71

References [1] Benevenuto, F., Magno, G., Rodrigues, T. and Almeida, V. 2010. Detecting spammers on twitter.

Collaboration, electronic messaging, anti-abuse and spam conference (CEAS) . 6, (2010), 12. [2] Fei, G., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M. and Ghosh, R. 2013. Exploiting Burstiness in Reviews for Review Spammer Detection.

Proceedings of the Seventh International Conference on Weblogs and Social Media, {ICWSM} 2013, Cambridge, Massachusetts, USA, July 8-11, 2013. (2013). [3] Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y. and Zhao, B.Y. 2010. Detecting and characterizing social spam campaigns.

Proceedings of the 10th ACM SIGCOMM conference on Internet measurement . (2010), 35–47. [4] Ghosh, S., Viswanath, B., Kooti, F., Sharma, N.K., Korlam, G., Benevenuto, F., Ganguly, N. and Gummadi, K.P. 2012. Understanding and combating link farming in the twitter social network.

Proceedings of the 21st … . (2012), 61–70. [5] Grover, A. and Leskovec, J. 2016. node2vec: Scalable feature learning for networks.

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), 855–864. [6] Hu, X., Tang, J., Zhang, Y. and Liu, H. 2013. Social Spammer Detection in Microblogging.

IJCAI (2013), 2633–2639. [7] K C, S. and Mukherjee, A. 2016. On the Temporal Dynamics of Opinion Spamming: Case

Studies on Yelp. (2016). [8] Kwak, H., Lee, C., Park, H. and Moon, S. 2010. What is Twitter , a Social Network or a News Media?

The International World Wide Web Conference Committee (IW3C2) . (2010) , 1–10. [9] Lee, K., Caverlee, J. and Webb, S. 2010. Uncovering social spammers: social honeypots+ machine learning.

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (2010), 435–442. [10] Li, H., Mukherjee, A., Liu, B., Kornfield, R. and Emery, S. 2014. Detecting Campaign Promoters on Twitter Using Markov Random Fields. (2014), 290–299. [11] Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. Distributed Representations of Words and Phrases and their Compositionality.

Nips . (2013), 1–9. [12] Mikolov, T., Corrado, G., Chen, K. and Dean, J. 2013. Efficient Estimation of Word

Representations in Vector Space.

Proceedings of the International Conference on Learning

Representations (ICLR 2013) . (2013), 1–12. [13] Perozzi, B., Al-Rfou, R. and Skiena, S. 2014. Deepwalk: Online learning of social representations.

Proceedings of the 20th ACM SIGKDD international conference on

Knowledge discovery and data mining (2014), 701–710. [14] Stringhini, G., Kruegel, C. and Vigna, G. 2010. Detecting spammers on social networks.

Proceedings of the 26th annual computer security applications conference (2010), 1–9. [15] Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J. and Mei, Q. 2015. Line: Large-scale information network embedding.

Proceedings of the 24th International Conference on World Wide Web (2015), 1067–1077. [16] Thomas, K., Grier, C., Song, D. and Paxson, V. 2011. Suspended accounts in retrospect: an analysis of twitter spam.

Proceedings of the 2011 ACM … . (2011), 243–258. [17] Weng, J., Lim, E.P., Jiang, J. and He, Q. 2010. Twitterrank: Finding topic-sensitive influential twitterers.

Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM 2010) . (2010), 261–270. [18] Yang, J. and Leskovec, J. 2011.

Patterns of temporal variation in online media.

WSDM (2011), 177 . [19] Zhang, X., Zhu, S. and Liang, W. 2012. Detecting spam and promoting campaigns in the Twitter social network.