Multiple Accounts Detection on Facebook Using Semi-Supervised Learning on Graphs
Xiaoyun Wang, Chun-Ming Lai, Yunfeng Hong, Cho-Jui Hsieh, S. Felix Wu
MMultiple Accounts Detection on FacebookUsing Semi-Supervised Learning on Graphs
Xiaoyun Wang, Chun-Ming Lai, Yunfeng Hong, Cho-Jui Hsieh, S. Felix Wu
University of California, DavisEmail: { xiywang, cmlai, yfhong, chohsieh, sfwu } @ucdavis.edu Abstract
In social networks, a single user may create multiple ac-counts to spread his / her opinions and to influence others, byactively comment on different news pages. It would be ben-eficial to both social networks and their communities, to de-mote such abnormal activities, and the first step is to detectthose accounts. However, the detection is challenging, be-cause these accounts may have very realistic names and rea-sonable activity patterns. In this paper, we investigate threedifferent approaches, and propose using graph embeddingtogether with semi-supervised learning, to predict whether apair of accounts are created by the same user. We carry outextensive experimental analyses to understand how changesin the input data and algorithmic parameters / optimizationaffect the prediction performance. We also discover that lo-cal information have higher importance than the global onesfor such prediction, and point out the threshold leading tothe best results. We test the proposed approach with 6700Facebook pages from the Middle East, and achieve the aver-aged accuracy at 0.996 and AUC (area under curve) at 0.952for users with the same name; with the U.S. 2016 electiondataset, we obtain the best AUC at 0.877 for users with dif-ferent names.
Introduction
In the past decade, the number of people using online socialnetworks (OSNs) as their sources of news and informationhas been growing rapidly. They not only receive informa-tion, but also share opinions. The cost of creating new ac-counts on OSNs is low, leading to the single-user multiple-accounts issue. People create multiple accounts for variousreasons, and we focus on those using multiple accounts tocomment on news pages. In that way, they build a false im-pression that their opinions are popular, in an attempt to in-fluence others in the online communities (King, Pan, andRoberts 2017).The multiple-accounts issue could also be raised by OSNsthemselves. For instance, Facebook randomizes the accountIDs when crawled by different crawler instances, as an anti-crawling feature. We consider the OSN-introduced multipleaccounts special cases of the multiple-accounts issue.
Copyright c (cid:13)
Multiple accounts are noises in datasets, and introduceinaccuracy in analyzed results, especially in user behaviorstudies. Detecting the multiple accounts cleans the data, andhelps to improve the result quality of overall data analyses.There are several existing approaches in multiple accountsdetection, but they face some challenges: • Large portion of ground truth: previous works using su-pervised learning usually require a large portion of sam-ples with ground truth, but getting ground truth is usuallyexpensive. We can only query several hundreds accountsper day for the ground truth. • Scalability: rich information, such as time stamps, IP ad-dresses, and textual contents, may yield good detectionresults, but processing large amount of information im-poses scalability issues. • Exact number of users: GCN (graph convolutional net-work) (Kipf and Welling 2016), previous semi-supervisedworks and clustering approaches all need the exact num-ber of users beforehand, which is usually unknown inpractice.In this paper, we investigate three methods to predictwhether a pair of accounts belong to the same user: • unsupervised learning using Katz similarity; • semi-supervised learning using Katz similarity; • semi-supervised learning using graph embedding.These methods use only a limited portion of ground truth,and who-comment-on-which-page information in the formof graphs, without knowing the exact number of users. Toaddress the scalability issue, we also develop a clustering-based approach to reduce the search space, and use alter-native ground truth to further reduce the number of groundtruth queries.We evaluate our methods with crawled pages from Face-book. With two small scale datasets, we compare the abovethree methods. We then extend to 100 datasets, each con-sists of accounts from the Middle East that shared the samedisplay name. We also test the methods using news pages re-lated to the 2016 United States election, with user activitiesrandomly distributed into multiple split accounts, to simu-late the effects of multiple accounts created by the same user.The obtained detection performance are reasonably good.We make the following contributions : a r X i v : . [ c s . S I] J a n a novel semi-supervised method to detect multiple ac-counts in online social network, by using graph embed-ding to measure distances between nodes in graphs; • evaluations using large real-world datasets, with bothOSN-introduced and user-introduced multiple accounts,showing state of the art performance; • experimental analyses to understand how the input dataand algorithmic parameters and optimization affect theprediction performance. Related Work
Detecting multiple accounts in OSNs has been explored withuser behavior analyses and graph theory in recent years. Ma-chine learning and artificial intelligence are also used for la-beling nodes on social networks, such as labelling Wikipediaarticle categories. (Tsikerdekis and Zeadally 2014)User behavior analysis is an important tools for multiple-accounts detection. Tsikerdekis and Zeadally (2014) usednonverbal behaviors, such as time-dependent discussion ofusers, articles, and article discussions for multiple accountidentity deception detection. Gurajala et al. (2016) utilizedprofile characteristics to detect fake Twitter accounts, basedon account features, such as creation times, update times, thenumber of friends and followers. Singh et al. (2016) clus-tered users by their textual behaviors. Sakakura et al. (2012)measured similarity using bookmarks for spam detection.User behavior analysis is based on profiling the basic fea-tures of users, and it inherently lacks deeper understandingin the interactions between users on OSNs. Moreover, it ispossible to reverse engineer these methods, and generate sta-tistical features that are similar to the normal accounts.Graph based approaches can reveal the user-user, user-article, and user-page relationships on social networks. Jianget al. (2013) analyzed profile visit histories as well as linksbetween passive profile and active comments, and built La-tent interactions graphs to understand the behavior of users,as well as to detect fake accounts on OSNs. Conti et al.(2012) detected fake accounts using graph structures andlongitudinal information. Wang et al. (2016) used click-stream and similarity graphs to cluster users, and then studyusers behaviors. All these methods are based on graph clus-tering, which requires knowing how many users exist in thedataset. However, we do not know how many distinct usersin our datasets. They can not tell whether two accounts be-long to the same user. Only basic graph features, such asaverage degree, in/out edges, are used, but no deeper under-standing of the graph is utilized.Machine learning models have also been applied in so-cial network studies. Xiao and et al. (2015) developedan supervised learning approach to detect fake accountsregistered by the same user, using IP addresses and reg-istration dates provided by the LinkedIn dataset. Logisticregression is widely used in prediction and classification.For example, Zheng et al. (2015) used support vector ma-chine (SVM) to detect spammers on social network. Naivebayes, decision trees, and random forest are popular ap-proaches (Fire, Katz, and Elovici 2012; Lai et al. 2017;Boshmaf et al. 2015) in detecting fake users and malicious users on social network too. Gong et al. (2014) appliedsemi-supervised learning for sybil detection, by labelingsampled nodes as benign or sybil, then classified the nodeswith information in the directed messages, together withknown labels of the nodes. The supervised learning meth-ods require a large number of ground truth to limit perfor-mance variances, and the semi-supervised methods for sybildetection requires label for sampled nodes. However, for ourapplication, getting the ground truth is time consuming, andthe nodes have no label. This makes those approaches notapplicable.Graph embedding is an approach that could quantifynodes into vectors. It is widely used in link prediction, nodeclassification, multi-label learning and clustering. Differentgraph embedding method captures different graph features.Node2Vec proposed by Grover and Leskovec (2016) isa method based on random walk; it provides a trade-offbetween global and local information. Choosing the rightbalance makes Node2Vec preserves community structures,as well as structural equivalences between nodes. Deepwalk (Perozzi, Al-Rfou, and Skiena 2014) introduced be-fore Node2vec, which can considered as a special case with p and q (parameters in Node2vec) equal to 1. SDNE (Wang,Cui, and Zhu 2016) and DNGR (Cao, Lu, and Xu 2016) areauto-encoders which capture non-linearity in graphs. Cao etal. (2015) used k-step probability matrix and matrix factor-ization to get global structure representations of graphs. Allthe deep learning methods above perform node classifica-tions or multi-labeling, and the number of different classesor labels are known. However, in most real-world applica-tions, such number is rarely available. Data Description
The data we used are crawled from Facebook using the So-cial Interactive Networking and Conversation Entropy En-gine (SINCERE) system (Erlandsson et al. 2015) for morethan five years, and consist of news pages from Asia, NorthAmerica, the Middle East and Europe. The open-sourcedSINCERE system uses distributed crawler instances to in-crease the crawling rate. However, as mentioned in the In-troduction section, Facebook randomizes the account num-bers for different crawler instances, leading to the OSN-introduced multiple accounts issue. These OSN-introducedcases are a little different from the user introduced ones: theOSN-introduced multiple accounts for the same user havethe same display name, while those user-introduced onesmay not. There are 10,044,228,650 accounts and 23,579,873pages in the database. But there are only two billion userson Facebook (Constine 2017), which means we are seeingabout five accounts per user on averaged.We focus our study on the news pages from the Mid-dle East and the 2016 US election. The Middle East datacover more than 6700 pages, and 100 datasets. Each datasetcontains all the accounts sharing the same display name, to-gether with all commenting activities by these accounts onthe news pages. The sizes of these datasets range from 9k to30k accounts, and 14k to 92k activities. We select data from Our datasets are available at anonymous_url he Middle East, because the users there have high tendencyto share the same names, giving us good opportunities to testour methods. The 2016 US election dataset covers 34 newspages, with 6 million accounts and 26,985,976 commentingactivities from these accounts. This dataset does not haveOSN-introduced multi-accounts, because only one crawlerinstance is used to get the dataset.
Ground Truth
We write a separated crawler to query the unique user ID(the primary ID) of each account (identified by a scope ID)for the Middle East datasets. This crawler runs much slowerthan the SINCERE system, at only hundreds of accounts perday. Running the primary ID crawler for large number ofaccounts is not practical, but it tells us the ground truth aboutwhich accounts are indeed OSN-introduced multiples.There is no good and guaranteed way to know which ac-counts are user-introduced multiples. Instead, we randomlyseparate activities of each account, which uniquely identi-fied a single user in this particular dataset, into differentsplit accounts, to simulate the user-introduced multiples. Inthis way, the ground truth is whether a pair of split accountscome from the same original account.
Design
We introduce our approaches in this section: begin with howto contract relationship graphs from the crawled datasets,followed by three methods for multiple accounts detection,together with scalability analysis and optimization at the endof the section.
Graph Construction
We construct a graph G ( V, E ) for each dataset, and carryout multiple accounts detection using the graph. The nodes V represent the accounts and the news pages, and the edges E represent relationships between accounts and pages: anedge e exists when an account comments on or likes a newspage. The constructed graphs are bipartite graphs, becauseedges only exist between accounts and pages, but not withinaccounts or pages themselves. We use undirected bipartitegraphs, either weighted or unweighted, in our experiments.However, our approaches do not depend on these graphproperties, and should be able to extend to more generalgraph types.Figure 1 gives an example based on a simple dataset, withthe constructed graph on the left and the detection resultson the right. A little bit of clustering can be seen from thisexample already, and we make use of the clustering resultsin our methods. Unsupervised Learning using Katz Similarity
Unsupervised learning has the advantages of not requiringany ground truth, which is difficult to obtain. The main stepof this method is to calculate similarities (i.e. distances)between different accounts. There are various matrices tomeasure the similarities, such as common neighbors, com-mon edges, and node-edge scores. We choose Katz similar-ity (Katz 1953) here. It is a commonly used topological Figure 1: An example of a constructed graph. Left: the con-structed graph, with accounts colored in white and pagescolored in red; right: the detection results, with multiple ac-counts from the same user colored by the same color, exceptfor black (pages) and white (users with only a single accounteach).measurement in social network studies, and Esfandar et al.(2010) used it for linked prediction, thus we want to use itto predict whether an pair of accounts belongs to the sameuser. Katz similarity can be computed by: ( I − βM ) − − I, (1)where M is the adjacency matrix representation of G , β is ascalar smaller than / (cid:107) M (cid:107) to ensure convergence, and I isthe identity matrix. Algorithm 1
Unsupervised Method using Katz Similarity compute the Katz similarity matrix S of G ; note the account nodes of G as V a ; if the similarity S u,v between two accounts u and v in V a is larger than an empirical threshold percentile α of S then u and v are predicted to be multiple accounts of thesame user; else u and v are predicted to be accounts of different users. end if The unsupervised method using Katz similarity is listedin Algorithm 1. It computes the similarity matrix to mea-sure how closed two accounts are in the graph, and predictsthey belong to the same user if the similarity is larger thana threshold percentile α . α is a critical parameter in thismethod. It is selected empirically, and may need to changeaccording to datasets. If a few ground truth are given, α canbe selected by cross validation. Semi-Supervised Learning using Katz Similarity
In order to avoid selecting the values of α , we turn to semi-supervised learning using the label spreading model (Zhouet al. 2004). Unlike conventional label spreading methodswhich work on the labels of nodes, we predict the labelsof node pairs. More precisely, we predict the value L u,v ofpair ( u, v ) , which is defined as how likely accounts u and v belong to the same user; L u,v equals to if u and v arendeed multiple accounts of the same user, if they belongto different users, and − when unknown. Algorithm 2
Semi-Supervised Method using Katz Similar-ity compute Katz similarity matrix S of G ; note the accounts nodes of G as V a ; randomly sample / of nodes in V a to query the groundtruth; assign each X u,v with max ( S ) − S u,v + (cid:15) ; assign each L u,v with: 1 if accounts u and v are knownto belong to the same user, 0 if known to belong to dif-ferent users, and -1 if unknown; train the label spreading model with RBF kernel, using X and L as inputs; predict elements in L which are initially unknown.The semi-supervised method using Katz similarity islisted in Algorithm 2. In the label spreading model, weuse Radial basis function (RBF) kernel: K ( x , x ) =exp( − (cid:107) x − x (cid:107) σ ) , where σ is the scaling parameter, and x i denotes the feature of a pair of nodes ( u, v ) . We useKatz similarity as a feature, so x i = S u,v ( S is defined in ( I − βM ) − − I ). We choose to sample / of the nodes,aiming an 6.25% label rate of input data.While this method does avoid knowing the values of α , ityields poor prediction accuracy. We suspect only giving thelabel spreading model a single scalar value per edge, whichis the Katz similarity, may not carry sufficient informationto make good prediction by the model; thus, we try to feedmore information into the model by using graph embedding. Semi-Supervised Learning using GraphEmbedding
Graph embedding gives each node a vector, and we useNode2Vec (Grover and Leskovec 2016) here. Node2Vec ob-tains the node embedding based on random walks, and thesewalks can be more local or global by controlling a pair ofparameters p and q , where p represent the likelihood to im-mediately revisit a node, and q controls whether the randomwalks are more in breadth or depth directions. When q > ,the random walks are more closed to the current nodes (i.e.walk locally); when q < , the random walks will visit nodesfar from the current ones (i.e. walk globally). As a result,by choosing these two parameters our algorithm can alsobalance the global or local information of graph. We willexamine how these parameters affect the prediction accu-racy in the upcoming Performance Analyses subsection. Wealso choose the size d of each embedding vector to be 128,which is the default setting in Node2Vec.The semi-supervised method using graph embedding islisted in Algorithm 3. Compared to Algorithm 2 using Katz,the only difference is on how the similarities between nodesare measured, and subsequently how the distances betweenedges are calculated. Because each node now carries a length d vector, there are several operations (listed in Table 1) thatcan be used to measure the similarity between the embed-ding vectors W u and W v of two nodes u and v . We try all Algorithm 3
Semi-Supervised Method using Graph Embed-ding use Node2Vec to give each node v of G an embeddingvector W v ; note the accounts nodes of G as V a ; randomly sample / of nodes in V a to query the groundtruth; assign X u,v with || W u , W v || ¯1 ; assign L u,v with: 1 if accounts u and v are known to be-long to the same user, 0 if known to belong to differentusers, and -1 if unknown; train the label spreading model with RBF kernel, using X and L as input; predict elements in L which are initially unknown.Operation DefinitionAverage [ W u ⊕ W v ] i = [ W u ] i +[ W v ] i Weighted L-1 norm [ || W u · W v || ¯1 ] i = | [ W u ] i − [ W v ] i | Weighted L-2 norm [ || W u · W v || ¯2 ] i = | [ W u ] i − [ W v ] i | Cosine W u cos W v = W u · W v || W u |||| W v || Table 1: Operations to measure similarities between twonodes with embedding vectorsthe operations, and observe that L-1 norm and cos yield thebest and similar accuracy, but L-2 norm and average do notperform so well in our tests.Graph embedding can extract more information from thegraphs than Katz similarity, and brings better prediction ac-curacy. However these semi-supervised methods have draw-backs, mainly in terms of scalability. For datasets with thou-sands of accounts, getting ground truth on / of themcan be very time consuming. Moreover, the label spread-ing model runs on pairs of accounts. The number of pairsis usually millions or more since it grows quadratically withnumber of accounts, and this will result in high computa-tional time and memory requirement. Thus, in the followingsection we will discuss how to scale up the proposed algo-rithms. Scaling up the proposed algorithms
We introduce a clustering-based approach to improve thescalability. The bottlenecks in our algorithm mainly exist attwo places: 1) the computational time and memory require-ment for label spreading, and 2) the time-consuming groundtruth queries. Both bottlenecks are related to the number ofpossible account pairs. Let n be the number of accounts in V a , thus there are n pairs of accounts as inputs to the la-bel spreading model. . The label spreading model runs in O ( n ) , with memory requirement in M ( n ) . If we can knowwith high certainty that some accounts are not from the sameuser, both bottlenecks can be overcome. The key idea is clus-tering: group accounts potentially belongs to the same userstogether, and separate those that do not. We use Spectralclustering (Yu and Shi 2003) to cluster V a into sub-graphsbased on the embedded vectors of nodes, then only queryround truth and use the label spreading model on accountpairs within the same sub-graphs. For example if we use c clusters, the computation complexity reduces to O ( n c ) , andthe memory footprint reduces to O ( n c ) . In this way, we sig-nificantly reduce the workload of label spreading.We also use alternative ground truth to reduce the numberof ground truth queries, and to improve scalability. Recordfrom the Unsupervised Learning using Katz Similarity sub-section, Katz similarity can roughly tell how closed two ac-counts are. By setting a high threshold (guaranteed to behigher than the empirical threshold), those pairs above sucha threshold should belong to the same user, with high cer-tainty; similarly, using a very low threshold could tell whichpairs are very unlikely to come from the same user.Figure 2 shows the flow chart for our proposed semi-supervised learning method using graph embedding, withthe graph clustering optimization. Experiments and Evaluation
In this section, we first compare the three methods intro-duced in the previous section with two small datasets, thenapply semi-supervised learning using graph embedding onlarge datasets, and finally analysis how input datasets, al-gorithmic parameters, and optimization affect the predictionperformance.Because the prediction results are binary, we evaluatethem using precision, recall, F1 score, accuracy and AUC(Area Under Curve, which measures the area under ROCcurve, if the prediction is totally random, AUC would be0.5). They are defined as following :1.
P recision = tptp + fp Recall = tptp + fn F = tp tp + fp + fn Accuracy = tp + fntp + fp + tn + fn AU C = (cid:82) T P R ( T )( − F P R (cid:48) ( T )) dT Here tp , f p , tn , f n denote number of true positive, falsepositive, true negative and false negative, respectively. T P R and
F P R represent true positive rate and false positive rate,which is defined as tptp + fn and fpfp + tn respectively.Data cleaning is necessary before running our approach.We want to filter out accounts with few actives, thus withlittle probability to come from the same user with other ac-counts. We project the bipartite graph G between accountsand pages on the account side, using the rule: node u and v are connected with edge weight w in projected graph G p , ifand only if account u and v have w common neighbors ingraph G . After we get the projected graph G p , we filter outthe nodes with degree less or equal to log | V | . Comparison among the Three Methods
We use two simple datasets: one with 188 accounts and 262activities, and the other with 4188 accounts and 6715 ac-tivities, to compare the three methods, and the results canbe found in Table 2. Each dataset covers a unique display Dataset Method Precision Recall AUC1 u. Katz 0.58 0.81 0.842 u. Katz 0.43 0.78 0.861 s. Katz 0.57 0.81 0.842 s. Katz 0.63 0.65 0.811 s. embedding 0.77 0.91 0.932 s. embedding 0.74 0.74 0.86Table 2: Comparison among three methods. u. stands for un-supervised, and s. stands for semi-supervised.precision recall F1 score accuracy AUC0.830 0.953 0.886 0.996 0.952Table 3: Prediction performance for 100 display names inMiddle East datasetsname from the Middle East pages, and they are not part ofthe large scale 100 names Middle East datasets.The performance of unsupervised method using Katz sim-ilarity is not bad. We use 75% and 95% as α for dataset 1 and2, respectively. With more users in dataset 2, the responserate is lower than dataset 1 with fewer users, as expected.The main issue of this method is the requirement of settingthe correct α , and we will show how changes in α affect theprediction performance in the Performance Analyses sub-section. In our experiments we find α using cross validation.However this method still has some advantages. It is fasterthan semi-supervised learning methods, because only ma-trix operations are used. It also helps to generate alternativeground truth, when the actual ground truth is not sufficient.Semi-supervised learning method using Katz similar-ity matrix performs worse than semi-supervised learningmethod using graph embedding. We suspect the poor per-formance is caused by the lack of information feeding intothe label spreading model, since only one scalar value is as-sociated with each given account pair.Semi-supervised learning method using graph embeddingbrings promising results, with both datasets. Unlike the un-supervised method, this semi-supervised method utilize asmall portion of ground truth. Furthermore, it feeds moreinformation to the label spreading model, with a length-128 embedding vector per account containing both local andglobal information, and a subsequent length-128 vector peraccount pair. Because of the good performance, we mainlyfocus on this method in the upcoming experiments. Evaluation with Large Datasets
We evaluate the semi-supervised learning using graph em-bedding method with two large scale datasets, one from theMiddle East and the other from the 2016 U.S. election.The results with 100 datasets (i.e. display names) from theMiddle East is summarized in Table 3, and the individualresults for each dataset are shown in Fig. 3. We did not getthe ground truth for all one million accounts, because thatwould take much longer time than practically possible; we nput graph Graph embedding Clustering (Spectral Clustering) Sub-graphs Sample accounts within sub-graphs & get ground truthsAssign initial values to LTrain the label propagation modelPredict multiple accounts using the trained modelOutput labels
Figure 2: Flowchart showing all steps of Algorithm 3 with optimizations.
Figure 3: Prediction performance of 100 datasets from the Middle East, arranged in decreasing sizes, from left to right.precision recall F1 score accuracy AUC0.780 0.756 0.768 0.996 0.877Table 4: Prediction performance of the U.S. 2016 datasetused alternative ground truth instead . There are overlapsbetween the queried and the alternative ground truth, andwe did not find conflicts in them. The overall performance isquite convincing, particularly with AUC at 0.952. When thenumber of accounts increases, the precision and the F1 scoretrend to increase , while the accuracy and recall are almostwithin a constant range.The results with the U.S. 2016 election dataset are listedin Table 4. We would like to detect active users using multi-ple accounts to comment on the news pages, users with lessthan 1000 comments are hence ignored. After distributinguser activities into multiple split accounts (mentioned in theGround Truth sub-section), our approach is able to achievegood prediction performance. This experiment is extremelychallenging, because it is very difficult to differentiate twousers who like to comment on similar set of pages, with onlygraph information. This is why the achieved AUC is slightlylower than other datasets contributed by the higher false neg-ative rate, but we still have it at 0.88.Our results show performance advantages over previ-ous works. Although not running on the same datasets, wecould still make some comparisons qualitatively. Cao et al.(2015) used supervised learning for fake account detection on LinkedIn datasets, and achieved 0.949 AUC, with all ex-tra information such as IP addresses and account creationtimes. We achieve 0.952 AUC, with the Middle East datasetsusing only graph properties. Tsikerdekis et al. (2014) useduser revision activities in a specific time window, togetherwith nonverbal information of articles and discussions, todetect multiple accounts in Wikipedia. Their precision, re-call and F1 score are 0.729, 0.646 and 0.688 respectively. Itcan be seen that, although we only use the graph features,lacking other information used by prior works, our perfor-mance is comparable or even better.
Performance Analyses
We carry out extensive experiments to further understandhow the prediction performance changes, with regard to theproperties of input datasets, as well as chosen parameter val-ues in the algorithms.
Averaged Number of Activities per Account
Figure 4shows how the averaged activity level affects the predictionperformance. In this experiment, we sample certain level ofactivity per user from the U.S. 2016 dataset, then randomlyassign the sampled activities of a user to 15 split accounts.The averaged number of activities per account is also theaveraged degree of nodes in the constructed graph. It can beseen that generally when each account has more activities,i.e. when the constructed graph is more dense, the predic-tion performance is better. This is because more activitieswill lead to more informative node embedding vectors in ouralgorithm. However, the improvement in AUC is marginal .300.400.500.600.700.800.90 0 20 40 60 80 100 120Averaged degreePrecision Recall AUC
Figure 4: Prediction performance vs. averaged number ofactivities per account.
Figure 5: Prediction performance vs. averaged number ofaccounts per user.when the averaged degree is larger than 30, and the best pre-cision happens when the averaged degree is 30.
Averaged Number of Accounts per User
Figure 5 showshow the averaged number of accounts per user affects theprediction performance. In this experiment, we use a subsetof the U.S. 2016 dataset with 100 users and 166831 activi-ties, and randomly assign the activities of a user to a varyingnumber of split accounts, s , from 10 to 50. AUC is mostlystable when s is between 10 to 25, and starts to drop when s is greater than 25. We also observed that recall drops when s is greater than 25 as well. This is because when the num-ber of accounts per user goes up, the activities from a singleuser are distributed into too many split accounts, each lack-ing the characteristics of the user, and thus leads to both in-accuracy in graph embedding and clustering. For the graphembedding, if two accounts belong to different users, thiswill make their embedding vectors very similar, and intro-duce high false positive rate in the label spreading model,thus decrease the precision. For clustering, putting accountsbelonged to the same user in different groups, yields highfalse negative rates, and subsequently low recalls. . Katz Similarity Threshold, α Figure 6 shows how theprediction performance changes with respect to differentvalues of α , using Dataset 2 from the Middle East. Whenincreasing α from 90% to 99.95%, AUC and the recall de-creases, while the precision increases to nearly 1. The em-pirical threshold for this particular dataset is 95%, which re-sults in the best combination of AUC, precision and recall. It Figure 6: Prediction performance vs. Katz similarity thresh-old α . Figure 7: performance in precision recall and auc in differ-ent length of embedding vectorseems that a few percentiles away from the empirical thresh-old will not significantly reduce the prediction performance.It’s clear that if α is set at 99.95%, all account pairs pre-dicted to be from the same user are indeed from the sameuser, because the precision is 1. Similarly, if α is set at 80%,all pairs predicted to be from different users are indeed fromdifferent users; our experiment shows that, when α is 80%,the precision of predicting 0 is 0.997. As an optimization,we can use predictions with such high confidence as alterna-tive ground truth. Using alternative ground truth can signif-icantly reduce the required number of ground truth queries,and thus reduce the time of getting such ground truth. Embedding Vector Length, d Figure 7 shows how theprediction performance changes with respect to the embed-ding vector length d , using Dataset 1 from the Middle East.Generally, increasing the vector length can improve perfor-mance, up to some point. When the vector is too short, forinstance d less than 32 in this particular experiment, the pre-diction results are not so good, because the amount of datagiven to the model is not sufficient. On the other hand, whentoo much information is present, i.e. d larger than 128, theembedding vector will be too long such that it contains fewinformation in each element, and the prediction performancedecreases. The best choice of d is around 128, which is thesame as the default value used by Grover et al. (2016). Node2Vec Parameters, p and q Figure 8 shows how theselection of local vs. global information in the embedding .400.500.600.700.800.90 p=0.25, q=16 p=0.25, q=8 p=0.5, q=8 p=0.5, q=4 p=0.5, q=2 p=0.5, q=1 p=1, q=1 p=2, q=0.5Precision Recall AUC
Figure 8: Prediction performance with different p and q val-ues Figure 9: Prediction performance vs. the number of clustersaffects the prediction performance, using the U.S. 2016dataset. The graph embedding method, Node2Vec (Groverand Leskovec 2016), is based on random walks, and it hasa pair of parameters p and q , with their explanation in theSemi-Supervised Learning using Graph Embedding subsec-tion. It is shown that generally, more local information, i.e.lower p/q value (with p < , q > ), yields better perfor-mance, which indicates local features are more importantthan global features in the multiple accounts detection. Butthere is a threshold, at where p/q is equal to / , that pro-duces the best AUC and recall. The reason is that, althoughnot as important as local information, global informationis also useful; totally ignoring global information will ad-versely affect the prediction performance. Number of Clusters, c Figure 9 shows how the predictionperformance changes with respect to the number of clustersused in the clustering optimization, using Dataset 2 from theMiddle East. The number of clusters has light influence onthe prediction performance, and no significant change hap-pens when the number of clusters increases from 1. Thisshows the clustering optimization will not have high impacton the prediction performance, when the number of clus-ters is small. However, it can significantly reduce the com-putation workload and the memory requirement of the la-bel spreading algorithm, so as leading to better scalability.However, when there are too many clusters, the recall willdecrease, because accounts from the same users are assignedto different clusters.
With or Without Alternative Ground Truth
We also testhow alternative ground truth will influence the result, usingDataset 2 from the Middle East. We substitute the sampledground truth with alternative ground truth from Katz simi-larity (discussed in the Scaling Up the Proposed Algorithmssubsection), and validate the predicted results against the ac-tual ground truth. We use three clusters in this experiment,and the precision, recall and AUC are 0.52, 0.76, and 0.86respectively, compared to 0.74, 0.74 and 0.86 when using thesampled ground truth. Recall and AUC mostly stay the same,but the precision decreases. When getting alternative groundtruth for account pairs that should be labelled as comingfrom the same user, we set α to a very high value, whichintroduces unbalanced sampling, thus inaccurate cuts, lead-ing to lower precision, and sightly higher recall. This showsusing alternative ground truth will not significantly decreasethe prediction performance, but can greatly reduce the num-ber of ground truth queries, thus better scalability. Conclusion and Future Work
We introduced unsupervised method using Katz similarity,and semi-supervised approaches using Katz similarity andgraph embedding, for multiple accounts detection in onlinesocial networks. Rather than making predictions to individ-ual accounts, our methods make prediction to each accountpair and do not require the exact number of actual users. Wealso purposed using clustering, and alternative ground truth,to enhance the scalabilty and lower the sample rate of data.Large scale experiments show that our approach works wellfor multiple accounts detection in different situations. Wealso explore how graph features and algorithmic parameters,including random walk strategies, will affect the results.It will worth incorporating graph convolutional network(GCN) features into our method, and constructing graph fea-tures in an end-to-end manner. The basic idea is to use anauto-encoder to encode each node, and a decoder to predictwhether two nodes in a graph belong to the same user. It willalso be interesting to find out, whether utilizing other infor-mation from the social network, such as time data, locationand textual information, can improve the multiple accountprediction performance. Similar methods may also be use-ful to detect malicious accounts.
References [Boshmaf et al. 2015] Boshmaf, Y.; Ripeanu, M.; Beznosov,K.; and Santos-Neto, E. 2015. Thwarting fake OSN ac-counts by predicting their victims. In Proceedings of the8th ACM Workshop on Artificial Intelligence and Security,AISec 2015, Denver, Colorado, USA, October 16, 2015, 81–89.[Cao, Lu, and Xu 2015] Cao, S.; Lu, W.; and Xu, Q. 2015.Grarep: Learning graph representations with global struc-tural information. In Proceedings of the 24th ACMInternational on Conference on Information and KnowledgeManagement, 891–900. ACM.[Cao, Lu, and Xu 2016] Cao, S.; Lu, W.; and Xu, Q. 2016.Deep neural networks for learning graph representations. InProceedings of the Thirtieth AAAI Conference on Artificialntelligence, February 12-17, 2016, Phoenix, Arizona,USA., 1145–1152.[Constine 2017] Constine, J. 2017. Number of monthlyactive facebook users worldwide as of 3rd quarter 2017(in millions). https://techcrunch.com/2017/06/27/facebook-2-billion-users/https://techcrunch.com/2017/06/27/facebook-2-billion-users/