[PDF] Bot-Match: Social Bot Detection with Recursive Nearest Neighbors Search

Abstract

Social bots have emerged over the last decade, initially creating a nuisance while more recently used to intimidate journalists, sway electoral events, and aggravate existing social fissures. This social threat has spawned a bot detection algorithms race in which detection algorithms evolve in an attempt to keep up with increasingly sophisticated bot accounts. This cat and mouse cycle has illuminated the limitations of supervised machine learning algorithms, where researchers attempt to use yesterday's data to predict tomorrow's bots. This gap means that researchers, journalists, and analysts daily identify malicious bot accounts that are undetected by state of the art supervised bot detection algorithms. These analysts often desire to find similar bot accounts without labeling/training a new model, where similarity can be defined by content, network position, or both. A similarity based algorithm could complement existing supervised and unsupervised methods and fill this gap. To this end, we present the Bot-Match methodology in which we evaluate social media embeddings that enable a semi-supervised recursive nearest neighbors search to map an emerging social cybersecurity threat given one or more seed accounts.

Full PDF

BBot-Match: Social Bot Detection with Recursive NearestNeighbors Search

DAVID M. BESKOW and KATHLEEN M. CARLEY,

Carnegie Mellon UniversitySocial bots have emerged over the last decade, initially creating a nuisance while more recently used to intimidatejournalists, sway electoral events, and aggravate existing social fissures. This social threat has spawned a botdetection algorithms race in which detection algorithms evolve in an attempt to keep up with increasinglysophisticated bot accounts. This cat and mouse cycle has illuminated the limitations of supervised machinelearning algorithms, where researchers attempt to use yesterday’s data to predict tomorrow’s bots. This gapmeans that researchers, journalists, and analysts daily identify malicious bot accounts that are undetected bystate of the art supervised bot detection algorithms. These analysts often desire to find similar bot accountswithout labeling/training a new model, where similarity can be defined by content, network position, or both. Asimilarity based algorithm could complement existing supervised and unsupervised methods and fill this gap. Tothis end, we present the Bot-Match methodology in which we evaluate social media embeddings that enable asemi-supervised recursive nearest neighbors search to map an emerging social cybersecurity threat given one ormore seed accounts.Additional Key Words and Phrases: graph embedding, social bot detection, coneten based information retrieval

Today, sophisticated state and non-state actors are using information systems in general and so-cial media in particular to change the beliefs and actions of target societies and cultures. These(dis)information campaigns, if left unchecked, gradually degrade the target society by eroding keyinstitutions and values while widening existing fissures. This information “blitzkrieg" has led to theemerging discipline of social cybersecurity in which societies attempt to protect their culture andvalues from external manipulation while maintaining a free market for opinions and ideas. One ofthe key functions that computer science brings to the multi-disciplinary table of social cybersecurityis bot/cyborg/sybil/troll detection and characterization.Supervised and unsupervised machine learning models both provide important contributions tobot detection, but are not sufficient for social cybersecurity practitioners. Supervised models trainedon prevalent bot data provide an initial “triage” of social media streams, identifying likely areas ofbot involvement and artificial manipulation of the online conversation. However, building supervisedmachine learning algorithms for every bot-detection scenario quickly grows untenable. Myriads ofbot genres have evolved, including spam bots, intimidation bots, propaganda bots, social influencebots, cyborg accounts, and many others. Each of these bot genres have unique features and arecurated and deployed in various ways depending on the target audience and culture. It is neitherpossible to train a single model that generalizes to every genre, nor is it convenient to label andtrain models for every genre and then update these models on a frequent basis to keep up withbot evolution. Unsupervised learning, on the other hand, provides a way to find certain types ofbots, such as the correlated bots found by the Debot algorithm [14]. These types of models areespecially helpful in identifying labeled data for supervised models [8], but once again they arenot sufficient. Often the most sophisticated and influential dis-information bots/cyborgs can fly“under the radar,” undetected by either supervised or unsupervised models. When we began to triageexternal manipulation of the Canadian political conversation in the runup to the Canadian 2019national elections we found multiple influential accounts had emerged that were not being detected

Authors’ address: David M. Beskow, [email protected]; Kathleen M. Carley, [email protected], CarnegieMellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213. a r X i v : . [ c s . S I] J u l Beskow and Carley by production bot detection algorithms, but nonetheless were 1) divisive, 2) appeared to have foreignconnections, and 3) appeared to have automated activity (i.e. were bots).We find that social cybersecurity practitioners in journalism, industry, academia and government,when faced with these sophisticated accounts, naturally ask the simple question “I wonder how manyother accounts are similar to this one?” We developed

Bot-Match to fill this gap, allowing analysts torapidly find similar accounts in a flexible manner where the analyst can determine how they want todefine similarity. This approach is complementary to supervised/unsupervised models, and is notdesigned to replace them.

Fig. 1. Framework to develop social media embeddings that enable a semi-supervised recursivenearest neighbors search to find similar accounts given one or more seed accounts

Similarity could be defined as similar network connections (either similar connections or similarnetwork role), similar content, or a combination of both. With Bot-Match the analyst can chooseto embed the conversation network, the conversation content, or both simultaneously, and then findsimilar accounts given a query. In this case the query is the seed node(s), and the algorithm returnsthe nearest neighbors given the predefined similarity measure. By recursively making this query, theanalyst can rapidly build out a sophisticated information campaign that is undetected by other socialcybersecurity tools. This approach is illustrated in Figure 1.Bot-match is a type of information retrieval where the query is a malicious actor and all of theirfeatures (semantic and network features). Within information retrieval this type of query is oftencalled Content Based Information Retrieval (CBIR). Google Reverse Image Search is an example ofCBIR. The power of Google reverse image search is that the user can upload a feature rich imagerather than a general and limited semantic query. For example, if an intelligence analyst wanted tofind emerging images of terrorists, simply typing "terrorist" into the Google image search wouldreveal many stock photos of a stereotypical terrorist, which is not helpful. However, if the same ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 3 intelligence analyst conducted a CBIR-based reverse image search with a photo of a terrorist fighterfrom a specific terrorist organization, complete with distinguishing flags, background, and uniqueface masks, this feature rich query would provide a meaningful and relevant query result.Bot-match, similar to Google reverse image lookup, seeks to use the feature rich information aboutan account to serve as the query for information retrieval, returning a ranked list of similar accounts.This search becomes a semi-supervised search as the user identifies additional accounts of interest,and recursively searches with these new accounts. While recursive, in practical social cybersecurityworkflows the user will seldom recurse more than a few ‘hops’, and will not fully explore the entirestream.Bot-Match is designed to complement but not replace or be compared to supervised models.Supervised models conduct a classification task, often on large streams of data. Bot-match conductscontent based information retrieval which returns a ranked list, often in a recursive manner, to exploreaccounts of interest and find similar accounts. Although semi-supervised, Bot-Match also differsin purpose from other semi-supervised label/beief propagation models. Label/belief propagation isattempting to classify a large stream with only a few labeled instances. Some label/belief propagationmethods produce a ranked list and have seen use in information retrieval. We will analyze thesebelow.The primary contribution of this paper is describing content based information retrieval with afeature rich account object for social cybersecurity. We merge ideas and models from informationretrieval, graph embedding, semantic embedding, semi-supervised learning, cluster analysis, andlink prediction to create a novel solution to a common hard question. We introduce two new socialcybersecurity data sets and devise a test and evaluation for candidate models. This paper also validatesthis approach on social media associated with US and Canadian social media data related to electionevents. While applying this in the specific context of malicious disinformation operations, otherintelligence and commercial data workflows could leverage content based information retrieval withrich account features.This paper is organized as follows. It begins by describing past research content based informationretrieval, particularly focused in semantic and graph embedding. We also briefly discuss relevantresearch from semi-supervised learning, information retrieval, and link prediction. We conduct theformal evaluation of these models on two bench mark datasets as well as two labeled data setsassociated with social cybersecurity, and use this evaluation to select models for Bot-Match. Wethen conduct a visual validation of the selected models using data associated with the 2018 USMidterm elections. Finally, we describe where Bot-Match fits in the social cybersecurity workflowand illustrate the use of the Bot-Match methodology in detecting disinformation actors in the 2019Canadian national elections.

This paper merges concepts from information retrieval, network embedding, document embedding,semi-supervised learning, link prediction, social recommendation, nearest neighbors classification,and social media analytics. While we don’t have time to go into depth in each of these deep fields ofstudy, we will highligh relevant and related research below.

Within information retrieval, the idea of object reverse search is called Content Based InformationRetrieval (CBIR), and has primarily been applied to images and other multi-media [32]. Allegedly,the idea of CBIR was born in a 1992 National Science Foundation Workshop [49]. Informationprofessionals often use CBIR for 1) discovering digital content on the web and 2) understanding how

Beskow and Carley images are reused [49]. Few CBIR examples exist outside of multi-media search and analysis. Theseinclude spatial and textual search [35] which searches based on spatial and textual similarity.Query on social networks is called Social Information Retrieval (SIR) [11, 20]. Research in SIRhas focused on social web search (improving classic information retrieval with social data), socialsearch (sourcing questions to a social network), and social recommendation (using social networksto improve item recommendation) [11]. These have led to the emergence of social content searchsystems like

TwitterSearch , Social Bing , and others. Social search is largely focused on how toleverage the rich tags that are used on many social platforms. Social Recommendation researchfocuses on classic collaborative filtering tasks. For example, SocialGCN [56] was introduced in2019 as a way to improve classic collaborative filtering with Graph Convolutional Networks toimprove recommending items to users. Similarly, Zhang et al. [61] uses matrix factorization withsocial regularization and the user scoring matrix to improve social recommendation.Related to our research, Choumane [15] proposed social information retrieval based on semanticsimilarity. Weng et al. [55] demonstrated a privacy-preserving search for content using rich data thathad been hashed. Privacy of both the query and the data store are preserved in this way. Neither ofthese specifically focused on embedding accounts for social cybersecurity

An embedding is a structure preserving map of one mathematical structure into another. The mathe-matical structure of X mapped into Y is defined as f X (cid:44) → Y . In our case we intend to map semanticstructure and graph structure into euclidean space, both separately and then simultaneously. Theembedding of semantic space has been an active research area since the 1950’s [26] and graphembedding dates back to at least the 1960’s [51] with combinatorial approaches. In this section wewill highlight past work and motivate our selection of evaluated models for semantic and graphembedding.All of these models are transductive. The learned embedding allows us to find new data that isalready represented within the network, but does not allow us to label new data from a differentgraph. Given the success of word embeddings [38], researchers have attempted to develop embeddings forphrases, sentences, and even documents. Early approaches simply averaged word embeddings forsentences and documents [42] which were expanded in sent2vec [41] using word and ngram embed-dings while simultaneously training the composition and the embedding. Doc2vec [30] extended theconcept of word2vec to paragraphs and other variable length text. To do this, doc2vec introduced afixed length paragraph vector to represent variable length text that is trained by predicting words inthe paragraph. Other approaches use a Recursive Neural Net (RNN) approach as demonstrated bythe Skip-Thought model [29]. These were later trained over Natural Language Inference (NLI) toachieve improved results [50].These approaches are often developed for a single language at a time, and are therefore limited onSocial Media where most large conversations have multiple languages represented. Multi-lingualembeddings have been accomplished by learning jointly on parallel corpora [43] or by trainingindependently and then mapping to a shared space with a bilingual dictionary [3]. Two competingmodels for universal encoding are 1) Google’s Universal Sentence Encoder [13] and Facebook’sLanguage-Agnostic SEntence Representations (LASER) toolkit [4]. The Google Universal SentenceEncoder (USE) encodes sentences and short paragraphs using two models, the Transformer modeland the Deep Averaging Network (DAN) model. The Transformer model uses the encoding subgraphof the transformer architecture to create context aware embedding. The DAN model uses a feed ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 5 forward deep neural network to average word and bigram representations. Facebook’s Laser toolkituses an encoder/decoder approach with Bidirectional Long Short-term Memory (BiLSTM) trained on223 million sentences to create a universal encoding scheme for 93 languages. In our implementationGoogle USE was trained on cleaned and concatenated user content and Facebook Laser was trainedat the individual tweet level a then tweet level embeddings were averaged to create a user/nodeembedding.Prior to word and sentence embedding, researchers attempted to analyze topic groups with severalmethods, notably Latent Dirichlet Allocation [10] and Latent Semantic Analysis/Indexing [17]. BothLatent Direchlet Allocation (LDA) and Latent Semantic Analysis (LSA) are used to discover latenttopics found in a corpus of documents and to reduce dimensionality. In the course of assigningdocuments to a fixed number of topics, both models are also reducing the dimensions of the corpusand inherently creating a document embedding. These approaches operate on a bag of words ortokens and are inherently multi-lingual (assuming appropriate language parsers). Latent DirichletAllocation (LDA) uses a probabilistic statistical model to map documents to topics while LatentSemantic Analysis/Indexing (LSA/LSI) uses singular value decomposition to reduce the dimensionsand thereby producing a set of topics or concepts. We used term frequency-inverse documentfrequency (TF-IDF) for LSA, but term frequncy for LDA since Blei describes how LDA was createdto overcome some of the shortcomings of TF-IDF.From the discipline of collaborative filtering we find another approach of measuring similarity anddelineating neighbors without creating an embedding. Starbird, Muzny and Palen used collaborativefiltering to find social media users on the ground during natural disasters [47]. Memory based collabo-rative filtering employs similarity measures to identify neighbors and thereby make recommendationsfor item-users data [44]. Using a similar approach we leverage Cosine and Jaccard similarity tomeasure similarity between user’s content based on bag of words representation. Given a bag of wordsrepresentation, cosine similarity measures the cosine of the angle between term frequency-inversedocument frequency (TF-IDF) vector representations of the document. Jaccard similarity, on theother hand, compares the relative intersection of two documents. Cosine similarity performs betteron TF-IDF, whereas Jaccard similarity performs better on a Bag-of-Words representation. Cosinesimilarity will take into consideration frequencies, Jaccard similarity will only consider the presenceor absence of words. This offers a baseline for comparison of more complex methods, and as wediscover performs surprisingly well when accounts have produced substantial content.

While most graph based analysis is designed to operate on the original adjacency matrix or equivalentstructure, recently methods have been devised to embed the graph in vector space. Vector spacerepresentations of graphs have applications in node classification, link prediction, clustering, andvisualization [23]. In this paper graph embedding is specifically focused on embedding nodes intovector space, not embedding the entire graph in vector space.For the purposes of this research we’ve adopted the topology that Goyal and Ferrara introduced[23]. They divide graph embedding techniques into methods based on 1) Factorization, 2) RandomWalk, and 3) Deep Learning. In our research we will test prominent models from each of thesecategories.Factorization methods use various methods to factorize the adjacency matrix or other matrixrepresenting the graph (Laplacian matrix, Katz similarity matrix, others). Eigenvalue decompositioncan be used on matrices that are positive semi-definite, otherwise gradient descent methods are used.The primary factorization models we tested were the High-Order Proximity preserved Embedding(HOPE) algorithm [40] and Facebook Biggraph [31], with Singular Value Decomposition of theAdjacency Matrix used as a baseline. HOPE preserves higher order proximity by minimizing

Beskow and Carley

Table 1. Model Description by Type

Type Subtype Model Data Embed DimContent Collaborative Filtering Jaccard Similarity Term Frequency No embeddingCosine Similarity Term Frequency No embeddingTopic Modeling LDA Term Frequency N x 200LSI TFIDF N x 200Document Doc2Vec User Text N x 200Universal Google USE User Text N x 512Facebook LASER Tweet Text N x 1024Network Factorization Graph Factorization Adjacency Matrix N x 32HOPE Adjacency Matrix N x 128BigGraph Edge list N x 1024Random Walk node2vec Edge list N x 64Splitter Edge list N x 128role2vec Edge list N x 128SybilRank Adjacency List No embeddingDeep Learning SDNE Adjacency Matrix N x 128GCN (no features) Adjacency Matrix N x 32Network&Content Deep Learning GCN with Features Adjacency Matrix N x 32GraphSAGE Adjacency Matrix & BoW N x 50Factorization BigGraph with initial Edge list w/ Embedding N x 1024 || S − Y S Y T || where S is a similarity matrix, instead of the adjacency matrix. In our case we used theKatz index to create the similarity matrix. Katz centrality is defined as C katz = ∞ (cid:213) k = n (cid:213) j = α k (cid:16) A k (cid:17) jk where A is the adjacency matrix and α is the attenuation factor (smaller than the absolute value of thelargest eigenvalue of A).In addition to using the HOPE algorithm, we also tested Facebook’s Pytorch-Biggraph toolbox[31]. Pytorch-Biggraph can embed large graphs using several available factorization based models(TransE, RESCAL, DistMult and ComplEx). Pytorch-Biggraph overcomes complexity and memoryconstraints by partitioning the graph and then using multi-threaded and distributed execution withbatched negative sampling. In testing Pytorch-Biggraph we wanted to determine what our loss ofperformance would be for scalability. We tested three settings for the BigGraph algorithm. The firstsetting is with a single edge type as is assumed in all other models. BigGraph allows the user todefine different edge types, which was also tested and reported using our three edge types ( mention , retweet , and reply ). This multi-modal approach did not perform well, and is not included in ourresults. The third and final setting that we test and report is BigGraph with an initial value, in our casea content embedding produced by LDA. This would initialize the training with knowledge learnedfrom the content similarity, and as reported below often improved BigGraph performance (though inthis setting BigGraph requires an initialization computation which may be computationally costlyand which makes BigGraph no longer an end-to-end solution). ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 7 Multiple models leverage random walks to embed a graph. The model that we used is the node2vec model [24] which uses a biased random walk procedure to explore neighborhoods while preservinghigher order proximity between nodes. We also tested the Splitter [19] random walk based algorithmthat is tailored for social networks and embeds multiple persona based representations of each userand then combines these to produce a single embedding for the user.Also within the random walk family of models we implement role2vec [1]. The role2vec algorithmleverages attributed walks which maps an input vector to a vertex type. The embedding structure for role2vec differs from all of our other methods in that it embeds the vertex type and not necessarilythe vertex neighborhood. Our implementation uses the Weisfeiler-Lehman kernel [46] to extractvertex features. SybilRank [12] is based on random walk but does not produce an embedding and isdiscussed in more detail below.Our primary Deep Learning model was the Structural Deep Network Embedding (SDNE) model[54] that uses deep autoencoders and decoders to preserve 1st and 2nd order network proximities.This is accomplished by optimizing both proximities simultaneously. The decoder is based onLaplacian Eigenmaps that penalizes similar vertices that are embedded far apart.

By its very nature social media data contains rich node features to include rich semantic features.Graph convolutional networks [28] have emerged as a way to embed a network simultaneously withthe respective node features. While a variety of node feature representations exist in social media, weprimarily used GCN’s to simultaneously encode the network while considering node content features(the combined tweet text produced by each account). GCN scales better than SDNE by iterativelyapplying a convolution operator on the graph and aggregating the embedding of neighbors. The GCNdefines a function of the form f ( X , A ) = so f tmax (cid:16) ˆ A ReLU (cid:16) ˆ A XW ( ) (cid:17) W ( ) (cid:17) where X represents node features and A represents network adjacency matrix, ˆ A = D ( A + I N ) D .In addition to Kipf’s original GCN autoencoder, we will also evaluate GraphSAGE (SAmple andAggreGatE) [25], which extends GCN’s to generate inductive embeddings by sampling and aggregat-ing each node’s neighborhood features. GraphSAGE is designed to embed larger networks than theoriginal GCN as we’ll demonstrate below. Several past research efforts are somewhat related to our use of similarity measures to detect bots.One well-known and often cited unsupervised machine learning tool is the Debot algorithm [14]which uses warped correlation to find correlated accounts. These correlated accounts are bot accountsthat post the same content at roughly the same time. Also notable is a model by Xiao et al [57] thatuses various features to classify entire clusters of new accounts on LinkedIn to detect batches of fakeaccounts. Magelinski et al [36] demonstrates bot detection with graph classification by extracting agraph’s latent local features and binning nodes together along 1-D cross sections of the feature space.Finally, Ali Alhosseini et al. [2] uses GCN’s with network, content, profile, and propagationfeatures to train a supervised machine learning model to detect fake news given URL and cascade-wise detection. This study also explains and measures the decrease in bot detection performanceas a model ages. Note that this research explicitly uses the GCN and geometric deep learning in asupervised manner, and while visualizing lower dimension representation of bot detection features,does not use these models in an unsupervised or semi-supervised manner as proposed by Bot-Match.

Beskow and Carley

Research has used semi-supervisedmethods to detect social bots (sometimes called sybils) since the earliest days of social networks.These methods include representative (linearlized) label propagation methods [21, 53], random walkmethods [27, 60], and recently adapted Graph Convolutional Network (GCN) methods for sybildetection with a few labeled instances [33]. There are several key differences between our methodand approach and that of these semi-supervised models. First, many of these assume that the networkcontains a sybil (bot) region and a benign region, with attack edges between them. We have foundthat this is not always the case, and many times the most sophisticated bots are tightly embeddedin the target audience (assumed to be benign). Second, while many of these methods use very fewlabels, most of the papers evaluate their methods with a sample of 100 [27] to several thousandinstances [53]. In our case, we often only have one instance to query with. Finally, we want to returna ranked list not a classification. Label propagation can at times return a ranked list and has seenlimited use for information retrieval [59]. For example, SybilRank [12] can return a ranked list andhas been used as a semi-supervised model to rank possible Sybil accounts. SybilRank does assumea Sybil region and benign region of the network, with attack edges between them (in other wordsthere is tight clustering among sybils and human users). Nonetheless, we include SybilRank forcomparison purposes in our evaluation. SybilRank (and other label propagation methods) do notproduce an embedding, and can therefore have faster run times if a user only performs a handfulqueries. Additionally, unlike some of our methods, SybilRank easily resolves queries with more thanone account.

In our task of measuring similarity between users in an online socialnetwork, we will borrow some concepts from social link prediction. For example, [5] proposes asocial user recommendation method that recommends based on either ‘topic’ or ‘social’ link. This issimilar to our desire to find similarity semantically and socially. Our method must diverge from thisin that existing links must be considered.

Bot-Match assumes that the user has a filtered stream or conversation that she is trying to analyzefor the presence of disinformation. This could be an online discussion around a topic (i.e. climatechange), a political event (Canadian 2019 Elections ), or a natural disaster (Hurricane Harvey). Notethat in all cases the individual tweets are part of a larger connected online conversation and are notrandomly selected from the social media environment (network analysis requires a network).Bot-Match also assumes some level of network homophily as well as a static network snapshot.We assume network links are far more likely to happen with accounts that are similar or have asimilar narrative than with accounts that are different. While some communication links may confrontan account that is different, we assume the majority of links are with similar accounts. In socialnetworks this is often referred to as homophily and defined as “birds of a feather flock together” [37].Additionally, we assume that the analyst is focused on a snapshot of a network (following links attime x or communication links during the month of January). While some of the methods discussedbelow could adapt to dynamic networks, many of them could not.Below we list two data sets that we collected in order to evaluate Bot-Match. Bot-Match is designedto find similar bot accounts that have evaded a supervised machine learning initial approach. In orderto test Bot-Match, we needed to find data sets where we can separately identify similar accounts,label them, and then test Bot-Matches ability to find them given a single seed node. The two datasets selected are discussed below: ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 9

Fig. 2. Conversational Network of Followers of a Journalist in Yemen. Red denotes random stringaccounts that were part of an intimidation campaign. Network includes 35,763 nodes and 195,172edges.

The first data set is the combined tweets produced by all followers of a freelance journalist inYemen. Starting in the Fall of 2017, a determined and documented intimidation attack was launchedagainst her Twitter Account [34]. The intimidation attack was characterized by a surge of strangeaccounts, many of them with strange and disturbing images or threatening messages. Many of theseintimidation accounts were distinguished by 15 digit randomly generated alpha-numeric strings fortheir screen name, such as gG6RKc6QBqOLKyU (not real screen name). We developed a logisticregression classifier to detect random strings based on features consisting of character n-grams andstring entropy. Using this model we were able to achieve 94.25% accuracy in identifying randomstring accounts in Twitter, allowing us to automatically label 4,312 accounts characterized by arandom string screen name and likely part of the coordinated intimidation campaign. While theseaccounts do not compose the entire intimidation attack, they nonetheless present an interesting socialcybersecurity account that we can externally label and test Bot-Match performance in finding themgiven a query node. In our experiment, we will test and see if nearest neighbor searches of variousembeddings would be able to find these random string intimidation accounts if we were not able tolabel them by their screen name. Throughout the rest of the paper this data will be called the

Yemen data. The conversational network of the Yemen data is provided in Figure 2 with random stringintimidation accounts colored red. Data description is provided in Table 2.

The second data set consists of tweets produced by Russia’s Internet Research Agency around thetime of the 2016 US Elections and released by Twitter in October 2018 [52]. The St Petersburghbased Internet Research Agency (IRA) is a company that conducts focused online informationoperations on behalf of the Russian government and Russian businesses. The IRA represents one ofthe more experienced organizations involved in state-sponsored disinformation [7]. Twitter detecteddeliberate manipulation by the IRA, suspended the accounts and released the related data in anelections transparency effort (similar manipulation has been associated with Iran, Venezuela, China,and Spain and related data released).

Fig. 3. Conversational Network of Russian Internet Research Agency data released by Twitter. Reddenotes accounts that targeted African American communities. Network includes 1,958 nodes and35,931 edges.

The data demonstrates that the IRA specifically targeted African American online users in aneffort to increase racial tensions in the United States [45]. For the purposes of testing Bot-Matchwe will label any account that shared relevant hashtags targeting African American populationsas an account that is participating in this effort. In our case study we will test the performance ofBot-Match to detect these accounts after removing all hashtags. The conversational network of theIRA data (removing all nodes that didn’t produce tweets in the released dataset) is provided in Figure3. Data description is provided in Table 2.

Table 2. Data Summary

Data Yemen Data IRA Data CORA Data Reddit DataUsers/Nodes 35,763 1,958 2,708 231,443Total Edges 195,172 35,931 5,278 11,606,919Tweets 4,535,117 9,041,308Top Languages en,ar,fr,es ru,en,de,ukRetweet Edges 108,382 31,398Mention Edges 50,933 859Reply Edges 35,857 4,122 ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 11

In addition to the two data sets evaluated above, we also included two bench mark data sets. Forevaluation we require network data with semantic node features and node labels. We chose to usethe Cora citation dataset which contains a bag of words (BoW) for semantic node features. Cora’snetwork topology and properties are substantially different than online social networks, but it wasselected due to its frequent use in network related research. We also chose to use a larger socialnetwork benchmark dataset derived from Reddit which uses semantic embedding features for nodefeatures. This was created and used by Hamilton et al. to evaluate GraphSAGE [25].

Given curated, filtered, and related social media data, Bot-Match first builds the communicationnetwork edgelist by assigning a source and target for the directional communication. In the Twitterenvironment, this means creating directed links between accounts that mention , retweet , or reply to each other. Once the network is created, we also remove any nodes that are not in the dataset(meaning we don’t have content and node features for them). For example, a user may be mentionedin the data set but they never produced a tweet that ended up in the dataset. These users are thereforeremoved, as well as any isolates that remain. If keeping the users was required then collecting theircontent/timeline would be an additional data requirement.For most of our models, retweet , reply , and mention edges are treated equally as directed commu-nication links. Only Facebook Big Graph algorithm will take into consideration the categories oflinks, with mixed results. In practice the analyst could choose to embed a single modality of interest,for example only the retweet network.Unless otherwise noted (see comments on Facebook LASER), the text associated with eachuser (node) is a concatenation of all social media posts associated with that user. To clean the textwe removed URLs, punctuation, reserved words, emojis and smileys. We removed hashtags andmentions with the IRA data but left them in for the Yemen data. They were removed from the IRAdata since they were used to label the data and would artificially inflate content embedding models. To evaluate embedding models for the Bot-Match methodology with the Yemen and IRA data,we created the respective content , network , and network + content embeddings for each data set,and then used these embedding to search for nearest neighbors of positively labeled accounts. Foreach data/embedding combination, we calculated the k nearest neighbors for k ∈ { , } andthen measured the precision of the response, defined as the proportion of positive responses. Afterconducting this query with each positively labeled account as the seed node, we averaged theprecision across all queries to compute a metric for the given data/embedding combination.Given that we calculated k nearest neighbors for k ∈ { , } , our primary metric was precisionat k = (p@10) and precision at k = (p@50). Given that the labels aren’t ranked, we cannotleverage any rank based metric, and precision at n is therefore appropriate. We can compare thesepercentages to the naive approach of random sampling, which is provided in Table 3. Any performanceover these random values indicates model value.In our evaluation, each query is an independent test, and is not meant to simulate an analystrecursively searching the entire network. In social cybersecurity application users generally recur-sively use Bot-Match two or three ‘hops’ from the seed node. We’ve never seen users try to traversean entire network given the unwieldy nature of this task. Because of this, our evaluation and theproduction Bot-Match methodology are not meant to estimate the total number of bots in a network(supervised machine learning models or label propagation models are better suited for this task). Without estimating the total number of bots in a stream, evaluating the recall and F1 score are notappropriate.The results for the embedding test are found in Table 3 and provide insights on both specificmodels and appropriate use cases for model types . The first observation is that all models providesignificant value to the user when compared to naive baseliness. For example, in the case of theYeman data, if a user queries with a single random string intimidation account, on average 5 out of 10returned accounts will be random string intimidation accounts (and the remainder may be other typesof intimidation accounts used in this attack). This provides real and tangible value to analysts. Thesecond observation is that all models perform better on IRA data than Yemen data. This is becausethe IRA data is smaller with a higher density of “similar” accounts and is more easily distinguishedin both graph and semantic representation.

Table 3. Model Evaluation

Yemen Data IRA Data CORA RedditType Model p@10 p@50 p@10 p@50 p@10 p@50 p@10 p@50Random Baseline 0.12 0.12 0.22 0.22 0.18 0.18 0.045 0.045Content Jaccard Similarity 0.421 0.387

Doc2Vec

Splitter 0.258 0.172 0.663 0.605 0.335 0.238 ** **role2vec 0.326 0.248 ** **SDNE 0.396 0.303 0.701 0.61 0.452 0.376 ** **GCN (no features) 0.285 0.213 0.612 0.58 0.566 0.339 ** **Network&Content GCN with Features

BigGraph w/ Initial 0.356 0.250 * These models are not able to embed the provided data structure** These models were not able to embed large graphs on our compute resources

Next we’ll compare the embedding types. With the much more integrated conversation found inthe Yemen data, we see the content models generally provide better precision across all values, whilethe more clustered IRA data has almost equal performance by both content and network embedding .The network + content embedding provides the best model for the Yemen data, and still outperformsmost network models for the IRA data, albeit with a Biggraph as opposed to GCN.Focusing on the content algorithms, we see the classic models excel in similarity analysis. Whilethese models will not necessarily create the contextual universal embedding that Google USE andFacebook LASER were designed for, they still excel at the basic task of document similarity anddocument retrieval. In particular the LDA model and Jaccard similarity perform exceptionally well,are inherently multi-lingual based on a bag of words or bag of tokens (though care must be taken ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 13 when choosing the size of term frequency matrix in the presence of many languages). We alsoobserve that while Doc2Vec has high performance, it only marginally surpasses LDA. The strongperformance by Cosine/Jaccard similarity and LDA is largely a result of having substantial contentfor each user. The value of a Bag of Words representation increases with more content.Focusing on graph embedding, we see strong performance by random walk algorithms node2vec , role2vec , and SybilRank across all datasets. Graph factorization and deep learning showed somesuccess, though this fades at higher levels of n . The Pytorch Biggraph model scales much betterthan any other model, but did not perform as well as other models in our implementation. The GCNmodel without features and the Splitter model did not perform well in our evaluation. SybilRankperforms strong across the data sets at lower values of k , though performance at higher k varied.Combining graph embedding with node features produced strong but mixed results. While GCN’sproduced the highest performance on the Yemen data, it produced mediocre performance on IRA data,with the reverse true for BigGraph (high results for IRA but less so for Yemen). This demonstratesthat on this social data that the GCN is getting more traction on the content features as opposed tothe BigGraph algorithms which is primarily focused on network features. GraphSAGE demonstratedstrong performance and scaled better than GCN. BigGraph scaled better than all other models.Given these results we selected Bot-Match algorithms for social media embedding in a socialcybersecurity context. For our production Bot-Match algorithm, we offer LDA for content similarity,node2vec for network similarity, GCN with features for Network and Content embedding. We alsouse Cosine Similarity and BigGraph implementations available for larger networks ( > K nodes).In Figure 4 we provide t-Distributed Stochastic Neighbor Embedding (tSNE) visualization for theselected models for both the Yemen and the IRA data. These visualizations provide more insight intothe model performances. We can visually see the higher precision of IRA data over Yemen data. Wecan also see various natural clusters emerging from the data, particularly the graph structure alreadyvisualized for the IRA data. (a) Yemen LDA (b) Yemen Node2vec (c) Yemen BigGraph (d) Yemen GCN (w/Features)(e) IRA LDA (f) IRA Node2vec (g) IRA BigGraph (h) IRA GCN (w/ Fea-tures)Fig. 4. t-Distributed Stochastic Neighbor Embedding (TSNE) 2-D Visualization of Embeddings forboth Yemen and IRA Data In this section we will compare the results seen in Table 3 for the Yemen data set with supervisedmachine learning methods, namely the Bot-Hunter model [6] and the Botometer model [16]. TheYemen data set was selected because it is in the data format consumed by these models. The IRAdata, while derived from Twitter data, was not released in its original format, and is therefore notcompatible with the data parsers and feature engineering of these models.In the Yemen data set we labeled 4,312 accounts as having 15 digit randomly generated alpha-numeric strings. These accounts were part of a larger bot intimidation attack conducted against a freelance journalist. In the Fall of 2017, her followers jumped from approximately 20K to approximately50K. Most of these were bots. At the time of our collection, we were able to get account history on35,763 of these that had not been suspended, removed, and become a private account.As seen in Table 4, Bot-Hunter and Botometer both find approximately 90% of the 15 digit randomalpha numeric screen name accounts (this is partially due to the fact that both models use screenname string entropy in their feature space). It also displays some of the limitations of supervisedlearning. Botometer, which has been tuned for high precision, only found random string bots. 100%of the accounts found by Botometer were Random String accounts, and it was not able to identifyany of the 15,000 to 20,000 additional bots. Bot-hunter, which was tuned for higher recall did findmany more non-random string bots, but 28,406 accounts is still too many to manually evaluate forcharacteristics and attribution. In this case, as the user begins exploring influential bots or othersuspicious accounts not found by these supervised models, Bot-Match allows them to quickly findsimilar accounts with content based information retrieval.

Table 4. Results of Supervised Machine Learning Models

Total Bots Percentage of Random String Accounts FoundBot-Hunter 28,406 86.5%Botometer 4,312 89.6%

Given the four models selected above, we wanted to conduct one additional visual validation of theBot-Match methodology in general and these four models in particular. We had previously collectedall Twitter content and connections associated with US Members of Congress or CongressionalCandidates for the 2018 US Midterm elections. We decided to use this data to test our embeddingssince it is easily labeled by both party (

Republican , Democrat , Other ) and by chamber (

House or Senate ). We wanted to test if the Bot-Match methodology and the four models selected above couldleverage the social media connections (friend connections) and the social media content to capturethe complex political environment of the US bicameral legislature in Euclidean vector space.The data was prepared in the same way as the Yemen and IRA data, with the notable differencethat the graph was constructed with friend links as opposed to communication links. Only membersof Congress or Congressional candidates were retained as nodes in the graph. We then used ourprimary models as discussed above to create content , network , and network + content embeddings.Finally, we visualized these embeddings in two dimensions using t-Distributed Stochastic NeighborEmbedding (TSNE). The visualization of this is provided in Figure 5, where red indicates Republican,blue indicates Democrat, and green indicates another party. Circles indicate House politicians, andtriangles represent Senate politicians.From this visual validation, we see that all four models are able to capture similarity betweenpoliticians of specific parties, and within parties is generally able to separate members of the Senate ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 15 (a) LDA (b) node2vec (c) BigGraph (d) GCN (w/ Features)Fig. 5. t-Distributed Stochastic Neighbor Embedding (TSNE) 2-D Visualization of US CongressionalMembers and Congressional Candidates for the 2018 US Midterm Elections. Red indicates Republicanpoliticians, Blue indicates Democrat, and Green indicates Independent. Circle markers indicate Housepoliticians/candidates, while triangles indicate Senate politicians/candidates. from members of the House. All four embeddings are also able to identify specific factions with eachof the parties. This visual validation gives us confidence that the selected embeddings are able tocapture the rich graph and semantic features and map them to euclidean space in such a way that ak-nearest neighbors search provides value in finding similar accounts. Bot-Match is an important tool in the social cybersecurity workflow. While social cybersecurityworkflows vary between teams and specific problem sets, a typical workflow is enumerated below.When an information campaign is initiated or expected, social cybersecurity analysts begin developinga data collection strategy in order to collect the core data associated with the information campaignor world event. Often this collection is either through key word filters or snowball sampling [22]of the network. Most large world events require iterative collection using both keyword filteringand snowball sampling. Once the data is collected, the team begins exploratory data analysis, whichoften consists of understanding the temporal and spatial density of the data as well as commonhashtags and influential accounts. With exploratory analysis complete, analysts attempt to classifyaccounts and images. They often run bot/cyborg/troll detection as well as propaganda detection onthe accounts. Bots are often used as force multipliers in an information campaign, and their presencecan help outline the boundary of the campaign. Additionally, analysts may attempt to classify actortype (media, politician, celebrity, etc). The classification stage can also include meme detectionto extract all memes from the social media stream (memes are helpful in that they help clearlyidentify messaging and target audience) [9]. By the time the team finishes classifying accounts andimages, they have usually whittled the stream down to the data of interest, and now they begin amore intensive characterization of the accounts in this smaller data set. Account characterizationmay be followed by campaign characterization, analysis of themes and narratives, and validation ofcampaign attribution (identification of the perpetrator). These generic social cybersecurity steps aresummarized and enumerated below:(1) Filter social media (key word filter or snowball sampling)(2) Exploratory data analysis (temporal/spatial distribution, common hashtags, influential ac-counts)(3) Classify accounts, images, etc(4) Characterize accounts, images, etc(5) Characterize campaign(6) Identify themes/messages/motives(7) Identify target audience(8) Attribution (identify perpetrator)

By the time that exploratory data analysis, account/image classification and account/image charac-terization are complete, the team has usually found a list of sophisticated accounts that are a corepart of the information campaign. These core/interesting accounts become the input for Bot-Match,allowing the team to find similar accounts in the campaign, building out the information campaign inan iterative fashion.In this way the Bot-Match tool and methodology is designed to be used in tandem with supervisedmachine learning models. Supervised models such as Botometer [16] and Bot-Hunter [6] are able tofind large concentrations of bots, triage the network, and provide macro level bot statistics. However,many bots, often the most interesting and effective bots, remain undetected. This is caused by the factthat supervised models can be brittle and are biased by their training data toward specific bot typesand genres [8, 58]. These accounts are often found through exploratory data analysis. Once found,Bot-Match allows the analyst to find other similar accounts that have also likely avoided detection.The interesting accounts that Bot-Match returns to the analyst become new seed nodes, resultingin a recursive search pattern that allows an analyst to rapidly uncover sophisticated informationoperations in a matter of hours. This method of query is more effective than the key-word booleansearch that is traditionally offered in social analytics tools. A query with all information (content andconnections) is more useful than a query with a single relevant hashtag.The embedding type ( content , network , and network + content ) is primarily selected based onuser requirements. If an analyst wants to find accounts that post similar content as a seed account,regardless of where they are in the network, then they should leverage semantic similarity. If trying tofind nodes that are proximate in the network structure, then network embedding is more appropriate.As a default, we found that embedding network + content with GCN (with Features) is a good defaultmodel if computationally feasible.The most attractive attribute of Bot-Match is its ability to adapt to any problem or search require-ment without labeling and training a new supervised machine learning model. All that is needed isa seed node and a target data set to search in. This provides tangible value to social cybersecurityanalysts in particular and social media analysis in general.In many ways the Bot-Match methodology provides a recommender system for social cybersecu-rity. Item-item recommendation systems (also known as collaborative filtering) recommend itemsbased on similarity between the items, often measured by user ratings of those items. If you areinterested in a hammer, then the recommendation system may recommend a hand saw based on itemsimilarity. In our case, Bot-Match says that if you are interested in a certain account manipulating atarget subculture, then you may also be interested in these additional k accounts that have similarconnections and narratives. Selection of seed accounts could be done explicitly, or may be assumedthrough browsing and other search and exploratory actions. In this section we will illustrate the use of the Bot-Match algorithm and methodology in a socialcybersecurity case study. In this case study, we will focus on analyzing information operations andspecific suspicious accounts in the 2019 Canadian National elections. The 2019 Canadian NationalElections were held on 21 October 2019. The formal campaign started on 11 September 2019, with atotal campaign duration of 40 days.Given the documented foreign influence in the 2016 US Elections [18, 39], the Canadian authoritiestook extra precautions to prevent similar tampering in their national election. The biggest policythey implemented was requiring all companies that have political advertising to set up a publicfacing registry with the specific ad and the name of the person who authorized that ad. Manycompanies (Reddit, Google, Microsoft, others) decided to ban political advertising altogether, whileothers (Facebook, Instagram, CBC.ca) began setting up registries [48]. This policy, while helpful ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 17 in stopping manipulative paid content that the IRA leveraged in the 2016 US Election, does notstop manipulation by accounts that produce content that is not promoted through advertisementfunding. This meant that bot, troll, cyborg, and sock-puppet accounts were still able to manipulatethe conversation. Our goal was to find and analyze these malicious accounts.To collect data associated with the 2019 Canadian National Election, our team used Twitter’sstreaming API to filter all tweets that contained hashtags associated with this event. We did this bystarting with a few general hashtags associated with the election (i.e. electionrelease of Trudeau 'blackface' pictures D a il y T w ee t C oun t bot FALSETRUE

Fig. 6. Temporal distribution of Canada 2019 National Election related tweets that were collectedwith the Twitter Streaming API. The density of accounts with “bot-like” attributes as predicted by theBot-hunter tool [6] is shown in red.

Having collected the data, we built two embeddings for the data, one focused on the nodeembedding of the graph, and the other focused on content embedding for the content of the tweets.The scale of this data collection meant that some of the embedding techniques we explored werecomputationally difficult or impossible as implemented above. Given this, we used the PytorchBiggraph model to embed the graph, and Latent Dirichlet Allocation (LDA) model to embed thecontent.The Pytorch Biggraph model was used to embed the communication network created by directedlinks associated with the communication modes in Twitter (mention, retweet, reply). Latent Dirich-let Allocation (LDA) was used to embed the content. We found the Biggraph model was morecomputationally tractable (20 minutes vs. 2 days for LDA), but LDA provided more meaningfulnearest neighbor relationships (the Biggraph embeddings provided too much noise in returned nearestneighbors).Using the LDA model, we used Bot-Match methodology to find the 10 nearest neighbors oftwo sophisticated bot accounts that were manipulating Canadian political discussion on Twitter.One of the bot accounts was manipulating the political right (Canadian Conservative Party) andthe other was manipulating the political left (Canadian Liberal Party). While this paper doesn’tprovide identifying information of the accounts, general descriptive information for both queriesis illustrated and provided in Table 5. This table includes general information associated with the accounts (number or tweets, number of followers, etc), as well as bot prediction probabilities by twoproduction supervised detection algorithms: Bot-hunter [6] and Botometer [16]. It also includes tophashtags by the query accounts and nearest neighbors to evaluate semantic correlation.From Table 5 we see that all nearest neighbors of both queries are clearly associated with thestance and narrative of the query account as noted in the top hashtags. We also see that many ofthe accounts appear to have some automated activity (are a bot or cyborg account) as indicated byhigh volume and high retweet percentages. We also notice that many of these accounts were notdetected by the state of the art production bot detection algorithms. The discrepancies between thesetwo models, seen particularly in the second query, is likely due to very different bot genres used fortraining data.As the analyst explores these accounts, additional accounts of interest may surface, creating newBot-Match queries, which results in the recursive nearest neighbors search of accounts of interest.This recursive nearest neighbor search of graph and semantic embedding provides an important toolfor social cybersecurity practitioners.

Table 5. Descriptive Results of Bot-Match Queries of Sophisticated Bots Manipulating 2019 CanadianPolitical Parties.

Nearest Neighbors Query with Sophisticated Conservative BotScreen Name

Query 39,396 11,157 4,616 0.690 0.148 0.636 TrudeauMustGo, cdnpoli, elxn43, DefundCBC, CPC

Neighbor 1 41,212 3,635 4,992 0.457 0.691 0.678 TrudeauMustGo, Trudeau, Canada, cdnpoli, CanadiansNeighbor 2 7,886 935 1,715 0.378 0.148 0.592 TrudeauMustGo, cdnpoli, elxn43, LiberalsMustGo, SayNoToGlobalismNeighbor 3 7,828 155 468 0.280 0.103 0.737 TrudeauMustGo, cdnpoli, elxn43, blackface, BREAKINGNeighbor 4 12,632 460 423 0.370 0.071 0.633 elxn43, TrudeauMustGo, cdnpoli, TrudeauWorstPM, TrudeauBlackfaceNeighbor 5 5,939 385 256 0.398 0.129 0.546 TrudeauMustGo, cdnpoli, elxn43, LiberalsMustGo, ButtsMustGoNeighbor 6 686 33 117 0.135 0.103 0.580 TrudeauMustGo, elxn43, cdnpoli, brownface, NotAsAdvertisedNeighbor 7 9,562 406 858 0.479 0.083 0.611 TrudeauMustGo, elxn43, cdnpoli, elxn2019, Scheer4PMNeighbor 8 22,057 319 538 0.339 0.096 0.605 cdnpoli, TrudeauMustGo, LiberalsMustGo, elxn43, FakeNewsMediaNeighbor 9 11 25 8 0.477 0.969 0.705 cdnpoli, elxn43, TrudeauMustGo, chooseforwardNearest Neighbors Query with Sophisticated Liberal BotScreen Name

Query 264,783 16,503 18,098 0.470 0.355 0.753 cdnpoli, elxn43, ChooseForward, topoli, onpoli

Neighbor 1 11,791 394 160 0.335 0.071 0.816 cdnpoli, elxn43, ChooseForward, NeverScheer, TeamTrudeauNeighbor 2 96,401 624 1,615 0.605 0.111 0.809 cdnpoli, elxn43, BREAKING, Scheer, TrudeauNeighbor 3 10,432 206 1,528 0.740 0.083 0.842 cdnpoli, elxn43, ChooseForward, Elxn43, CDNpoliNeighbor 4 35,557 735 379 0.560 0.071 0.870 cdnpoli, elxn43, ChooseForward, Trudeau, CPCNeighbor 5 22,937 226 983 0.668 0.138 0.848 cdnpoli, elxn43, ChooseForward, ChooseForwardWithTrudeau, IStandWithTrudeauNeighbor 6 27,377 1041 557 0.592 0.096 0.878 cdnpoli, elxn43, ChooseForward, TeamTrudeau, IStandWithTrudeauNeighbor 7 5,026 438 1,540 0.558 0.103 0.843 cdnpoli, elxn43, ChooseForward, ScheerWasSoPoorThat, IStandWithTrudeauNeighbor 8 3,445 334 1,026 0.564 0.066 0.768 cdnpoli, elxn43, ChooseForward, NeverScheer, YankeeDoodleAndyNeighbor 9 37,845 257 387 0.435 0.222 0.927 cdnpoli, elxn43, ChooseForward, CPC, onpoli

As discussed above, we have already deployed prototype models of Bot-Match to monitor ma-licious disinformation in the Canadian elections. Having discovered sophisticated bot accountsmanipulating both the political left and political right in Canada, we used Bot-Match to build outthese campaigns and delineate the respective manipulative (dis)information operations.

The concept of using an account and all associated features and connections as a query has many ap-plications beyond social cybersecurity. These applications include retail, link prediction, intelligence,and information retrieval.The retail business is one of the first adopters of recommender systems, and is arguably the mostmature at deploying scalable collaborative filtering. These systems are inherently constrained by theuser, and suffer from cold-start challenges. All product recommendations for the user are limited bythe users own biases and ignorance, some of which they’d like to circumvent. Using the concept ofan account query, an online retailer could allow a user to receive recommendations based on someone ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 19 else’s account (a celebrity, a friend, or someone else whose tastes they admire and wish to emulate).Social recommendations are some of the most powerful product recommendations, and allowing aperson to receive recommendations as if they were someone else could be profitable for the retailindustry. This approach would have a number of privacy hurdles to overcome, but if implementedcorrectly, allowing a customer to “Shop as if they were ....” could be the next big step in online retail.Social media companies use recommendation systems and link prediction algorithms to makerecommendations to a user based on their interests, content, and existing links. As with the retailbusiness, some users may like to see recommendations as if they were someone else. For example, ayoung professional would like to see recommendations being presented to an established professionalin their field who they follow and attempt to emulate. Once again, these raise significant but notinsurmountable privacy concerns.The final application area is in the area of intelligence. Many systems used for the intelligencecommunity use simple boolean search patterns to search repositories of unstructured data. While keyword searches may be required for initial exploration, as analysts find entities they are interested in,these entities and all content and connections associated with them could be used as a search query.This type of rich query could provide better search results for intelligence analysts.

This paper evaluates state of the art graphical and semantic embedding for social media data, and thenleverages these embeddings for bot detection to enable social cybersecurity. Bot-Match is evaluatedin two new social cybersecurity datasets, validated on a third dataset associated with US politics, andthen demonstrated on a fourth dataset associated with the 2019 Canadian National Elections. Withinthe emerging discipline of social cybersecurity, the Bot-Match paradigm provides a novel way foranalysts to find similar nefarious actors and recursively discover a complex disinformation operationwithout labeling and training a supervised machine learning model. As such, it extends the conceptof content based information retrieval beyond multi-media. Finally, while used within the socialcybersecurity context, this approach has broad application to retrieval tasks that are characterizedby network connections and semantic content. This includes document retrieval, recommendationsystems, social recommendation, and other use cases.

ACKNOWLEDGMENTS

This work was supported in part by the Office of Naval Research (ONR) Multidisciplinary UniversityResearch Initiative Award N000140811186, Award N000141812108, ONR Award N00014182106and the Center for Computational Analysis of Social and Organization Systems (CASOS). The viewsand conclusions contained in this document are those of the authors and should not be interpreted asrepresenting the official policies, either expressed or implied, of the ONR or the U.S. Government.

REFERENCES [1] Nesreen K Ahmed, Ryan Rossi, John Boaz Lee, Theodore L Willke, Rong Zhou, Xiangnan Kong, and Hoda Eldardiry.2018. Learning Role-based Graph Embeddings. arXiv preprint arXiv:1802.02896 (2018).[2] Seyed Ali Alhosseini, Raad Bin Tareaf, Pejman Najafi, and Christoph Meinel. 2019. Detect Me If You Can: SpamBot Detection Using Inductive Representation Learning. In

Companion Proceedings of The 2019 World Wide WebConference . ACM, 148–153.[3] Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A Robust Self-learning Method for Fully UnsupervisedCross-lingual Mappings of Word Embeddings. arXiv preprint arXiv:1805.06297 (2018).[4] Mikel Artetxe and Holger Schwenk. 2018. Massively Multilingual Sentence Embeddings for Zero-shot Cross-lingualTransfer and Beyond. arXiv preprint arXiv:1812.10464 (2018).[5] Nicola Barbieri, Francesco Bonchi, and Giuseppe Manco. 2014. Who to follow and why: link prediction with explanations.In

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining . 1266–1275. [6] David M Beskow and Kathleen M Carley. 2018. Bot Conversations Are Different: Leveraging Network Metrics for BotDetection in Twitter. In . IEEE, 825–832.[7] David M Beskow and Kathleen M Carley. 2020. Characterization and Comparison of Russian and Chinese DisinformationCampaigns. In

Fake News, Disinformation, and Misinformation in Social Media , Kai Shu, Suhang Wang, Dongwon Lee,and Huan Liu (Eds.). Springer, New York.[8] David M Beskow and Kathleen M Carley. 2020. You are Known by Your Friends: Leveraging Network Metrics for BotDetection in Twitter. In

Open Source Intelligence and Security Informatics , Mohammad Taybei, Uwe GlÃd’sser, andDavid Skillicorn (Eds.). Springer, New York.[9] David M. Beskow, Sumeet Kumar, and Kathleen M. Carley. 2020. The evolution of political memes: Detecting andcharacterizing internet memes with multi-modal deep learning.

Information Processing & Management

57, 2 (mar2020), 102170. https://doi.org/10.1016/j.ipm.2019.102170[10] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet Allocation.

Journal of Machine LearningResearch

3, Jan (2003), 993–1022.[11] Mohamed Reda Bouadjenek, Hakim Hacid, and Mokrane Bouzeghoub. 2016. Social networks and information retrieval,how are they converging? A survey, a taxonomy and an analysis of social information retrieval approaches and platforms.

Information Systems

56 (2016), 1–18.[12] Qiang Cao, Michael Sirivianos, Xiaowei Yang, and Tiago Pregueiro. 2012. Aiding the detection of fake accounts inlarge scale social online services. In

Presented as part of the 9th

U SEN I X

Symposium on Networked Systems Designand Implementation NSDI 12) . 197–210.[13] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal Sentence Encoder. arXiv preprint arXiv:1803.11175 (2018).[14] Nikan Chavoshi, Hossein Hamooni, and Abdullah Mueen. 2016. DeBot: Twitter Bot Detection via Warped Correlation..In

ICDM . 817–822.[15] Ali Choumane. 2014. A semantic similarity-based social information retrieval model.

Social Network Analysis andMining

4, 1 (2014), 175.[16] Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and Filippo Menczer. 2016. Botornot: Asystem to evaluate social bots. In

Proceedings of the 25th International Conference Companion on World Wide Web .International World Wide Web Conferences Steering Committee, 273–274.[17] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing byLatent Semantic Analysis.

Journal of the American Society for Information Science

41, 6 (1990), 391–407.[18] Renee DiResta, Kris Shaffer, Becky Ruppel, David Sullivan, Robert Matney, Ryan Fox, Jonathan Albright, and BenJohnson. 2018.

The Tactics & Tropes of the Internet Research Agency . New Knowledge.[19] Alessandro Epasto and Bryan Perozzi. 2019. Is a Single Embedding Enough? Learning Node Representations ThatCapture Multiple Social Contexts. In

The World Wide Web Conference . Acm, 394–404.[20] Dion Goh and Schubert Foo. 2008.

Social information retrieval systems: Emerging technologies and applications forsearching the web effectively . Information Science Reference Hershey, PA.[21] Neil Zhenqiang Gong, Mario Frank, and Prateek Mittal. 2014. Sybilbelief: A semi-supervised learning approach forstructure-based sybil detection.

IEEE Transactions on Information Forensics and Security

9, 6 (2014), 976–987.[22] Leo A Goodman. 1961. Snowball sampling.

The annals of mathematical statistics (1961), 148–170.[23] Palash Goyal and Emilio Ferrara. 2018. Graph Embedding Techniques, Applications, and Performance: A Survey.

Knowledge-Based Systems

151 (2018), 78–94.[24] Aditya Grover and Jure Leskovec. 2016. Node2vec: Scalable Feature Learning for Networks. In

Proceedings of the22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining . Acm, 855–864.[25] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In

Advances inneural information processing systems . 1024–1034.[26] Zellig S Harris. 1951. Methods in Structural Linguistics. (1951).[27] Jinyuan Jia, Binghui Wang, and Neil Zhenqiang Gong. 2017. Random walk based fake account detection in onlinesocial networks. In .IEEE, 273–284.[28] Thomas N Kipf and Max Welling. 2016. Semi-supervised Classification with Graph Convolutional Networks. arXivpreprint arXiv:1609.02907 (2016).[29] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.2015. Skip-thought Vectors. In

Advances in Neural Information Processing Systems . 3294–3302.[30] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In

International conferenceon machine learning . 1188–1196. ot-Match: Social Bot Detection with Recursive Nearest Neighbors Search 21 [31] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019.Pytorch-biggraph: A Large-scale Graph Embedding System. arXiv preprint arXiv:1903.12287 (2019).[32] Michael S Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval:State of the art and challenges.

ACM Transactions on Multimedia Computing, Communications, and Applications(TOMM)

2, 1 (2006), 1–19.[33] Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervisedlearning. In

Thirty-Second AAAI Conference on Artificial Intelligence .[34] Al Bawaba The Loop. 2017. Thousands of Twitter Bots Are Attempting to Silence Reporting on Yemen. (2017).[35] Jiaheng Lu, Ying Lu, and Gao Cong. 2011. Reverse spatial and textual k nearest neighbor search. In

Proceedings of the2011 ACM SIGMOD International Conference on Management of data . 349–360.[36] Thomas Magelinski, David Beskow, and Kathleen M Carley. 2020. Graph-Hist: Graph Classification from Latent FeatureHistograms With Application to Bot Detection. In

Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 34.34–44.[37] Miller McPherson, Lynn Smith-Lovin, and James M Cook. 2001. Birds of a feather: Homophily in social networks.

Annual review of sociology

27, 1 (2001), 415–444.[38] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations inVector Space. arXiv preprint arXiv:1301.3781 (2013).[39] Robert S Mueller. 2019. Report on the investigation into Russian interference in the 2016 presidential election.

US Dept.of Justice. Washington, DC (2019).[40] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric Transitivity Preserving GraphEmbedding. In

Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and DataMining . Acm, 1105–1114.[41] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017. Unsupervised Learning of Sentence Embeddings UsingCompositional N-gram Features. arXiv preprint arXiv:1703.02507 (2017).[42] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation.In

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (emnlp) . 1532–1543.[43] Sebastian Ruder, Ivan Vuli´c, and Anders Søgaard. 2019. A Survey of Cross-lingual Word Embedding Models.

Journalof Artificial Intelligence Research

65 (2019), 569–631.[44] Badrul Munir Sarwar, George Karypis, Joseph A Konstan, John Riedl, et al. 2001. Item-based Collaborative FilteringRecommendation Algorithms.

WWW

Journal of Machine Learning Research

12, Sep (2011), 2539–2561.[47] Kate Starbird, Grace Muzny, and Leysia Palen. 2012. Learning from the crowd: collaborative filtering techniquesfor identifying on-the-ground Twitterers during mass disruptions. In

Proceedings of 9th International Conference onInformation Systems for Crisis Response and Management, ISCRAM

Journal of the Association for Information Science and Technology

68, 9 (2017), 2264–2266.[50] Eleni Triantafillou, Jamie Ryan Kiros, Raquel Urtasun, and Richard Zemel. 2016. Towards Generalizable SentenceEmbeddings. In

Proceedings of the 1st Workshop on Representation Learning for NLP . 239–248.[51] William Thomas Tutte. 1960. Convex Representations of Graphs.

Proceedings of the London Mathematical Society

3, 1(1960), 304–320.[52] Twitter. 2018. Twitter Election Integrity Data Archive. https://about.twitter.com/en_us/values/elections-integrity.html.Accessed: 2018-11-30.[53] Binghui Wang, Neil Zhenqiang Gong, and Hao Fu. 2017. GANG: Detecting fraudulent users in online social networksvia guilt-by-association on directed graphs. In . IEEE,465–474.[54] Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural Deep Network Embedding. In

Proceedings of the 22nd AcmSigkdd International Conference on Knowledge Discovery and Data Mining . Acm, 1225–1234.[55] Li Weng, Laurent Amsaleg, April Morton, and Stéphane Marchand-Maillet. 2014. A privacy-preserving framework forlarge-scale content-based information retrieval.

IEEE Transactions on Information Forensics and Security

10, 1 (2014),152–167.[56] Le Wu, Peijie Sun, Richang Hong, Yanjie Fu, Xiting Wang, and Meng Wang. 2018. Socialgcn: An efficient graphconvolutional network based model for social recommendation. arXiv preprint arXiv:1811.02815 (2018). [57] Cao Xiao, David Mandell Freeman, and Theodore Hwa. 2015. Detecting Clusters of Fake Accounts in Online SocialNetworks. In

Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security . ACM, 91–101.[58] Kai-Cheng Yang, Onur Varol, Pik-Mai Hui, and Filippo Menczer. 2019. Scalable and Generalizable Social Bot Detectionthrough Data Selection. arXiv preprint arXiv:1911.09179 (2019).[59] Lingpeng Yang, Donghong Ji, Guodong Zhou, Yu Nie, and Guozheng Xiao. 2006. Document re-ranking using clustervalidation and label propagation. In

Proceedings of the 15th ACM international conference on Information and knowledgemanagement . 690–697.[60] Haifeng Yu, Michael Kaminsky, Phillip B Gibbons, and Abraham Flaxman. 2006. Sybilguard: defending against sybilattacks via social networks. In

Proceedings of the 2006 conference on Applications, technologies, architectures, andprotocols for computer communications . 267–278.[61] Tian-wu Zhang, Wei-ping Li, Lu Wang, and Jie Yang. 2020. Social recommendation algorithm based on stochasticgradient matrix decomposition in social network.