[PDF] Graph embeddings for Abusive Language Detection

Abstract

Abusive behaviors are common on online social networks. The increasing frequency of antisocial behaviors forces the hosts of online platforms to find new solutions to address this problem. Automating the moderation process has thus received a lot of interest in the past few years. Various methods have been proposed, most based on the exchanged content, and one relying on the structure and dynamics of the conversation. It has the advantage of being languageindependent, however it leverages a hand-crafted set of topological measures which are computationally expensive and not necessarily suitable to all situations. In the present paper, we propose to use recent graph embedding approaches to automatically learn representations of conversational graphs depicting message exchanges. We compare two categories: node vs. whole-graph embeddings. We experiment with a total of 8 approaches and apply them to a dataset of online messages. We also study more precisely which aspects of the graph structure are leveraged by each approach. Our study shows that the representation produced by certain embeddings captures the information conveyed by specific topological measures, but misses out other aspects.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Graph embeddings for Abusive Language Detection

No´e C´ecillon* · Vincent Labatut · RichardDufour · Georges Linar`es

Received: date / Accepted: date

Abstract

Abusive behaviors are common on online social networks. The increas-ing frequency of anti-social behaviors forces the hosts of online platforms to ﬁndnew solutions to address this problem. Automating the moderation process hasthus received a lot of interest in the past few years. Various methods have beenproposed, most based on the exchanged content, and one relying on the struc-ture and dynamics of the conversation. It has the advantage of being language-independent, however it leverages a hand-crafted set of topological measures whichare computationally expensive and not necessarily suitable to all situations. In thepresent paper, we propose to use recent graph embedding approaches to automat-ically learn representations of conversational graphs depicting message exchanges.We compare two categories: node vs. whole-graph embeddings. We experimentwith a total of 8 approaches and apply them to a dataset of online messages. Wealso study more precisely which aspects of the graph structure are leveraged byeach approach. Our study shows that the representation produced by certain em-beddings captures the information conveyed by speciﬁc topological measures, butmisses out other aspects.

Keywords

Graph embedding · Automatic abuse detection · Conversationalgraph · Online conversations · Social networks

No´e C´ecillonLaboratoire Informatique d’Avignon – LIA EA 4128, Avignon Universit´e, France E-mail:[email protected] LabatutLaboratoire Informatique d’Avignon – LIA EA 4128, Avignon Universit´e, France E-mail:[email protected] DufourLaboratoire Informatique d’Avignon – LIA EA 4128, Avignon Universit´e, France E-mail:[email protected] Linar`esLaboratoire Informatique d’Avignon – LIA EA 4128, Avignon Universit´e, France E-mail:[email protected] a r X i v : . [ c s . S I] J a n No´e C´ecillon* et al.

In recent years, online social media have allowed people to meet and discuss world-wide. These popular platforms attract more and more users, and are confrontedwith an increasing number of abusive behaviors. This phenomenon started to drawattention from governments, requesting companies to perform moderation on theirsocial media platforms. Depending on the size of the communities to be admin-istered, this could be an expensive process since moderation is currently mainlydone by human operators. Moreover, this task is diﬃcult, especially because thedeﬁnition of what constitutes an abuse is ambiguous and can vary depending onthe context ( e.g. media platform, community, and/or country). In order to au-tomate the detection of abusive content in such social media, researchers haveproposed methods primarily based on Natural Language Processing (NLP) ap-proaches. These works rely on the textual content of the exchanged messages to de-tect speciﬁc types of abuse, such as oﬀensive language [9,28,44,29], hate speech [14,37,42,2], racism [43] and cyber-bullying [13]. However, a major limitation of suchNLP-based approaches is their sensitivity to intentional text obfuscation done bymalicious users to fool automatic systems. For instance, sh1t and f*ck are easilyunderstandable by humans but diﬃcult to detect by an algorithm if the trainingcorpus does not reﬂect such situations. Therefore, NLP statistical systems cannotreﬂect all the forms of language that these abuses can take, due to the wide varietyof language registers (which can range from colloquial to sustained), the languageproﬁciency of the contributors, and even the particular vocabulary inherent to theconcerned community of users.To address this language limitation and dependency, authors have proposedto incorporate behavioral information about users and the structure of conver-sations [4,7,12,25,46] as a way to improve the eﬃciency of language-based ap-proaches. In a previous work [31], we proposed to leverage conversational graph-based features to detect abusive messages in chat logs extracted from an onlinegame. Such conversational graphs model interactions between users ( i.e. who isarguing with whom?), completely ignoring the language content of the messages.We characterized the structure of these graphs by computing a large set of manu-ally selected topological measures, and used them as features to train a classiﬁerinto detecting abusive messages. As we did not know in advance which topologi-cal measures are the most discriminative for this task, we had to consider a verylarge set, and performed feature selection in order to identify the most relevantones. This constitutes an important limitation of this method, and more generallyof such feature engineering approaches. One important drawback is the impor-tant run-time caused by the large number of measures to compute. Furthermore,since the set of measures has to be manually constituted by humans, it could benon-exhaustive, missing relevant features for the task at hand, or on the contraryinclude a lot of redundant information. It is even possible that no measure deﬁnedin the literature captures the relevant information to perform the task at hand.

Graph embedding methods automate this graph representation process. Theyallow representing graphs as low-dimensional vectors while preserving at least apart of their topological properties. On the one hand, these representations are au-tomatically learned, so they do not require to perform any feature selection, andthey are much more time-eﬃcient than the approaches described above. On theother hand, unlike standard topological measures, the obtained representations are raph embeddings for Abusive Language Detection 3 not directly interpretable in terms of graph structure. It is therefore not straight-forward to understand exactly which information is captured by the embedding,and is possibly relevant to the application. Moreover, diﬀerent embedding methodsare assumed by construction to capture diﬀerent aspects of the graph structure,but it is diﬃcult to compare them directly, for the same reason. One way to assessthe appropriateness of an embedding method for a task, and to compare severalembedding methods through on it, is to do so empirically.In [16], Goyal & Ferrara propose such an experimental work. They compareﬁve methods on tasks such as unobserved link prediction and node classiﬁcation.They conduct their experiments on various types of networks ( e.g. social relation-ships, user network, collaboration network). Nonetheless, only node-level methodsare tested and, as stated in [6], performances of graph embedding methods arevery task-dependent. Therefore, most eﬀective methods in [16] might not be asappropriate on the task we focus on in this work. Mishra et al. [25] propose toproﬁle authors in order to enhance the detection of abusive content online. Theyconstruct a community graph representing all authors and their connections anduse a node embedding method to obtain a vector representation of each user called user proﬁle . This method shows promising results when combined with standardabuse detection methods relying exclusively on the textual content. It is howeverlimited to the use of a single embedding method.In this work, we adopt an approach similar to Goyal & Ferrara and applyit to our abuse detection task. We leverage the (already mentioned) frameworkpresented in our previous work [31], which is able to classify messages dependingonly on the structure of the conversation surrounding them. Text is not used in theprocess, only conversational networks , which makes it language-independent. Onthis basis, our ﬁrst contribution is to study the eﬀectiveness of graph embeddings inthe context of online abuse detection. We assess and compare 8 methods designedto operate at diﬀerent scales of the graph (node and whole-graph), and to preservediﬀerent structural properties. Our second contribution is an analysis of our resultsaiming at better understanding which structural properties of the graph are wellcaptured by the considered embedding methods.The rest of this article is organized as follows. First, in Section 2, we review theliterature related to node and whole-graph embedding methods. Then, we presentour task in Section 3, as well as the baseline that we previously developed andthe embedding methods that we use in our experiments. In Section 4, we describeour dataset, our experimental protocol and settings, the results that we obtain,and we discuss the topological properties preserved by each considered embeddingmethod. Finally, in Section 5, we summarize our main ﬁndings and present someperspectives.

Generally speaking, the expression graph embedding refers to a family of methodsaiming to represent graphs, or parts of graphs, in a low-dimensional space inwhich at least certain aspects of their structure are preserved [6]. By construction,objects which are similar for these aspects have close vector representations in theembedding space [45]. In addition to the plain structure, certain methods are able

No´e C´ecillon* et al. to capture additional information such as node labels or the weight and directionof edges.One can distinguish four main categories of graph embedding methods, de-pending on the nature of the considered objects: node embedding, edge embed-ding, subgraph embedding and whole-graph embedding. Each category better ﬁtsthe needs of diﬀerent applications and problems. In this work, our task can beformulated as a node and/or graph classiﬁcation problem, hence in the rest of thispaper, we focus exclusively on both of these types of embeddings.The rest of this section is a review of the main node and whole graph embeddingmethods. Table 1 summarizes these methods, and show their main characteristics.2.1 Node EmbeddingNode embedding is the most common form of graph embedding in the litera-ture. Such methods take a graph as input, for instance as an adjacency matrix,and output a vector of ﬁxed dimension for each node in the graph. Following thetaxonomy proposed in [16], we distinguish three categories, depending on the gen-eral approach used to perform the transformation: Matrix Factorization, NeuralNetwork and Random Walks. Note that the latter also uses neural networks butintroduces a diﬀerent strategy to sample the graph.

Matrix Factorization

There are various ways to represent a graph in a matrix form,such as the adjacency, Laplacian or transition matrices. The pioneering studies onnode embedding propose to map nodes into low-dimensional vectors by decom-posing such matrices into products of smaller matrices of the desired dimension,a process called

Matrix Factorization (MF).The most straightforward approach is to leverage existing dimensionality re-duction techniques, originally designed for tabular data, and apply them to a graphmatrix. Doing so with the

Locally Linear Embedding (LLE) method proposed byRoweis & Saul [35] amounts to considering that every node in the graph is aweighted linear combination of its neighbors. The method ﬁrst estimates weightsthat best reconstruct the original characteristics of a node from its neighbors, andthen uses these weights to generate vector representations. This method has beenused in the literature to perform face recognition [45]. Belkin & Niyogi propose

Laplacian Eigenmaps (LE) [5], a method aiming at keeping strongly connectednodes close in the result space. Representations are obtained by computing theEigenvectors of the graph Laplacian. Typical applications for this method in-clude node classiﬁcation and link prediction [5,16]. A major drawback of thesetwo methods is their important time complexity, making them poorly scalableand impossible to use on very large real-world graphs.Ahmed et al. propose a method called

Graph Factorization (GF) [1] whichis much more time eﬃcient and can handle graphs with several hundred million nodes. GF uses stochastic gradient descent to optimize the matrix factorization.To improve its scalability, GF uses some approximation strategies, which can intro-duce noise in the generated representations. Furthermore, GF focuses on preservingonly the ﬁrst-order proximity, i.e. nodes which are directly connected have closerepresentations. Hence, the global graph structure is not necessarily well preserved raph embeddings for Abusive Language Detection 5 by this method. Ahmed et al . use this method to partition graphs and to predictthe volume of e-mail exchanges between pairs of users [1].Ou et al. introduce a MF method called

High-Order Proximity preserved Em-bedding (HOPE) [30]. This similarity matrix is obtained using centrality measureslike Rooted PageRank, Katz measure and Adamic-Adar score. HOPE is speciﬁ-cally designed to preserve asymmetric transitivity in directed graphs. To this end,two vector representations are learned for each node, a source vector and a target vector. Applications of this method includes link prediction, proximity approxi-mation and vertex recommendation [23]. However, once again the time complexityof this MF method is high and does not allow the processing of very large graphs.Li et al. present

BoostNE [22]. This multi-level graph embedding frameworklearns multiple graph representations at diﬀerent granularity levels. Inspired fromboosting, it is built on the assumption that multiple weak embedding can leadto a stronger and more eﬀective one. It applies an iterative process to a closedform node connectivity matrix. This process successively factorizes the residualobtained from the previous factorization, to generate increasingly ﬁner represen-tations. The sequence of representations produced is then assembled to create theﬁnal embedding. Li et al . apply their method to a multi-label node classiﬁcationtask.

Neural Networks

Neural approaches have been successfully adapted to many ﬁeldsincluding graph embedding. Wang et al. propose the

Structural Deep Network Em-bedding (SDNE) framework [40]. This method learns representations based on ﬁrstand second order proximities in the graph. These two properties are jointly opti-mized using a deep autoencoder and a variation of Laplacian Eigenmaps, applyinga penalty when similar nodes are mapped far from each other in the embeddingspace. This allows a good representation of both the local and global structure ofthe graph. This method has been used on tasks similar to the embedding methodLE, i.e. node classiﬁcation and link prediction [16,40].Kipf & Welling develop a method called

Graph Convolutional Networks (GCN) [19].It uses an iterative process wherein each iteration captures local neighborhood, andtheir repetition allows capturing the global neighborhood of nodes. At each iter-ation, the process aggregates the representations of neighboring nodes and uses afunction of the obtained representation and the embedding at previous iteration togenerate the new representation. Kipf & Welling leverage their method to performdocument and entity classiﬁcation.

Generative Adversarial Networks (GAN) have also been adapted to node em-bedding. Wang et al. [41] propose

GraphGAN , which works through two models.First, a generator G ( v | v c ) tries to approximate the true connectivity between nodes v and v c and selects the most likely connected nodes to v c . Second, a discrimina-tor D ( v, v c ) computes the probability of an edge to exist between v and v c . Thegenerator tries to ﬁt the distribution of nodes as much as possible to generate the most indistinguishable fake pairs of connected nodes. The discriminator triesto distinguish between ground truth and the fake pairs created by the generator.This method is however only able to capture the local structure. Wang et al . ap-ply GraphGAN to node classiﬁcation, link prediction and movie recommendationtasks.

No´e C´ecillon* et al.

Random Walks

Random-walks have ﬁrst been adopted by graph embedding ap-proaches trying to mimic word-embedding methods such as word2vec [24]. Theyallow representing the graph structure under a sequential form, analogous to sen-tences in a text. They are used to sample the graph, and can been seen as a proxiallowing to obtain a partial representation of its structure. They also have the ad-vantage of being able to deal with graphs too large to be explored in their entirety.Given a starting node, random-walk-based methods generate node sequences byselecting a neighbor and repeating this procedure until the node sequence reachesa certain length.Perozzi et al. propose

DeepWalk [33]. It is among the ﬁrst node embeddingmethods based on random-walks. First, DeepWalk samples node sequences usinguniform random walks and then applies the standard

SkipGram model [24] to gen-erate the representations. This model takes a node as input and aims at predictingits context, i.e. the nodes in its neighborhood. With this method, nodes with sim-ilar contexts share similar representations. Typical applications of this approachinclude node classiﬁcation [18,40] and link prediction [30]. However, a limitationis that two nodes can be structurally similar ( i.e. they play the same role in thegraph) but be distant in the graph, hence, not share any common neighbors. Theirrepresentations might thus be completely diﬀerent.The

Node2vec [17] method proposed by Grover & Leskovec was developed fol-lowing the idea of

DeepWalk . The main diﬀerence is that

Node2vec uses biased random-walks to provide a more ﬂexible notion of a node’s neighborhood andbetter integrate the notion of structural equivalence. It has been used to predictlinks in a biomedical context [23], and to classify nodes [18].

Node2vec randomlyinitializes the node embeddings, which can result in being stuck in a local optimaduring the computation of embeddings. Chen et al. propose an improved weightinitialization strategy to avoid such problems in their method

Hierarchical Repre-sentation Learning method (HARP) [8]. In

Walklets [34], Perozzi et al. introducea new random walk strategy. Traditional random-walk methods select the nextnode from the current node’s neighbors. Instead,

Walklets proposes to skip overnodes to obtain sequences of nodes which are not direct neighbors. This strategyallows modeling and preserving higher order relationships between nodes and canbe used in multi-label classiﬁcation problems [34].2.2 Whole-Graph EmbeddingAs mentioned before, node embedding methods are the most widespread in theliterature. But some tasks require information at a higher granularity, in whichcase one would turn to whole-graph embedding. These methods allow to representa whole graph as a single vector of ﬁxed length. They take a collection of graphs,and output a representation for each of them.de Lara & Pineau [20] propose a S imple and F ast (SF) algorithm based on thespectral factorization of the graph Laplacian. It computes the k smallest positive Eigenvalues of normalized Laplacian of the input graph in ascending order, toform the representation of the whole-graph. de Lara & Pineau use their approachto predict the properties of chemical compounds [20].Mousavi et al. [26] introduce a whole-graph embedding hierarchical frameworkcalled

Pyramidal Graph Embedding (PyrGE), based on some ideas originating from raph embeddings for Abusive Language Detection 7

Table 1

List of graph embedding approaches and the additional information they can encode.PyrGE and Graph2vec can additionally handle node attributes and node/edge labels, respec-tively. Column

Sc. stands for Sc ale ( N ode vs. W hole G raph); and Cat. for C ategory ( M atrix F actorization, N eural N etworks and R andom W alks). Columns W. and D. indicate whetherthe method supports weighted and directed links, respectively. Sc. Cat. Method Ref. W. D. Application

N MF LLE [35] – – Image processingLE [5] (cid:51) – Node classiﬁcation, link predictionGF [1] (cid:51) – Graph partitioningHOPE [30] (cid:51) (cid:51)

Node recommendation, proximity approximationBoostNE [22] (cid:51) – Node classiﬁcationNN SDNE [40] (cid:51) – Node classiﬁcation, link predictionGCN [19] (cid:51) – Document/entity classiﬁcationGraphGAN [41] – – Movie recommendationRW DeepWalk [33] – – Node classiﬁcation, link predictionNode2vec [17] (cid:51) (cid:51)

Node classiﬁcation, link predictionHARP [8] – – Node classiﬁcationWalklets [34] – – Node classiﬁcationWG MF SF [20] – – Molecular property predictionPyrGE [26] – – Graph classiﬁcationFGSD [39] (cid:51) – Graph classiﬁcationNetLSD [38] – – Graph classiﬁcation, Community detectionNN Graph2vec [27] – – Graph visualization, similarity ranking image processing algorithms. Important global information from images can beextracted by recursively analyzing local information. In the context of graphs, thismeans that the overall graph structure can be modeled by analyzing substructuresat diﬀerent scales. To this end, a graph pyramid is formed with subgraphs ofdiﬀerent scales. Every graph is embedded into vector representations which areall concatenated to form the global graph embedding. The representations areobtained by factorizing an aﬃnity matrix. PyrGE is especially designed for largegraphs since they potentially contain more diﬀerent scales. Mousavi et al. used itfor graph classiﬁcation tasks.Verma & Zhang propose a

Family of Graph Spectral Distances (FGSD) [39]to represent a whole-graph. This method is built on the assumption that thegraph atomic structure is encoded in the multiset of all node pairwise distances.It computes the Moore-Penrose Pseudoinverse spectrum of the graph Laplacian.Vector representation of the whole graph is constructed from the histogram ofthis spectrum. Typical tasks include graph classiﬁcation in various ﬁelds such asbioinformatics and social networks [39,38].Tsitsulin et al. introduce

NetLSD [38], a permutation- and size-invariant, scale-adaptive embedding method. Like aforementioned node embedding method LE [5],

NetLSD operates on the Laplacian matrix of the graph. It relies on a physicalanalogy consisting in simulating a heat diﬀusion process on the graph to preserveits structure. The method processes the amount of heat transferred between nodesat diﬀerent times scales. These heat traces at diﬀerent time scales are then usedto compute the heat trace signature of the graph, i.e. the vector representation of

No´e C´ecillon* et al. the graph. Tsitsulin et al. use

NetLSD for graph classiﬁcation and for communitydetection.Narayanan et al. design Graph2vec [27], which can be viewed as an adaptationof DeepWalk [33] and Node2vec [17] to the whole-graph embedding paradigm.Indeed, these two approaches generate random walks to approximate the contextin which nodes appear and fetch them to a

SkipGram model. Graph2vec also usesa

SkipGram model, but it operates on rooted subgraphs since the method is aimedat representing whole graphs and not nodes. Hence, similarly to nodes with similarneighborhoods sharing close representations in DeepWalk and Node2vec, graphscontaining the same rooted subgraphs share similar representations in Graph2vec.A

SkipGram model is then trained on these subgraphs and generates the wholegraph representations.

Graph2vec has been used to perform graph classiﬁcation,graph visualization and similarity ranking [3]. It is able to capture informationabout node labels additionally to the graph structure.

In this work, we focus on a task consisting in detecting abusive messages in chatlogs. This can be formulated as a classiﬁcation problem consisting in deciding ifa message is abusive or not. In order to turn a chat log into a graph, we rely ona conversational graph extraction method that we previously introduced in [31],and that we brieﬂy present in Section 3.1. The principle here is that classifyingthe messages amounts to classifying the graphs that represent them.This setup allows us to experiment with various node and whole-graph em-bedding methods, which we present in Section 3.3. For comparison, we use as abaseline a set of features that we manually crafted in a previous work [31]. Theseare constituted of a large set of topological measures, that we selected to getthe most exhaustive representation of the graph that we could, as explained inSection 3.2. We view the embedding methods as a way to automate the elabora-tion of this representation of conversational graphs, which was otherwise designed manually in [31] through feature selection.Figure 1 gives an overview of our experimental framework, highlighting the dif-ferences between the approaches based on the topological measures (top) and theembedding methods (bottom). The baseline features are computed separately forthe input graph, before being concatenated to form the global representation of thegraph, and this single vector is ﬁnally fetched to the classiﬁer. By comparison, thegraph embedding method directly produces a single vector representation of ﬁxedlength (6 in this example) which is then sent to the classiﬁer in a straightforwardway.3.1 Graph Extraction

Intuitively, the content exchanged in an online conversation could be assumed to bethe most relevant information to detect important events, such as the occurrenceof abuses. However, we have showed in a previous work [31] that the dynamics ofthe conversation, i.e. the way the interactions between its participants unfold, isalso critical, and can even lead to better classiﬁcation results. This information can raph embeddings for Abusive Language Detection 9

Classi ﬁ cation ConversationalGraph

Graph Embedding-based ApproachTopological Measures-based Approach

PredictedClass

Vector representation m m p M M q TopologicalmeasurescomputingEmbeddingcomputing Concatenation

Vector representation m ( v ) m p ( v ) ⟨ m ⟩ ⟨ m p ⟩ M M q Fig. 1

Overview of our experimental framework. The top part corresponds to the approachadopted in our baseline, whereas the bottom part describes the method used with graphembeddings. Figure available at 10.6084/m9.ﬁgshare.7442273 under CC-BY license. be leveraged by modeling the exchanges between participants through a so-called conversational graph . In this work, we use the same method to extract graphs fromconversations. We explain the most essential points of this process in the rest ofthis section, but the interested reader will ﬁnd a more detailed description in [31].Our method is designed to process a stream of messages posted in a givenchatroom. It extracts a graph describing the conversational context of a messageof interest, called targeted message . In the context of a classiﬁcation task, thismessage corresponds to the message that one wants to classify. We deﬁne a so-called context period centered around this targeted message, and containing the k messages published right before and right after it (where k is a predeﬁned con-stant). The graph corresponds to the temporal integration of this events occurringduring this period. Each one of its nodes represents a participant of the conver-sation, which was active at least once during the considered context period. Thegraph links are directed and weighted. They model the interactions between theseparticipants over the period: their directions reﬂect the communication ﬂow, andtheir weights represent the overall intensity of the exchanges. The iterative processused to estimate the link directions and weights is detailed in [31], as well as theparameters that can be used to control this process.Figure 2 illustrates our graph extraction method. The left part is the con-versation (stream of messages), with the targeted represented in red. The graphrepresenting the state of the conversation around this message is shown on theright. Its red node models the author of the targeted message.3.2 BaselineOur baseline relies on our previous work presented in [31,11]. In [31] we only focuson graph-based features, but in [11] we also leverage textual content to performthe classiﬁcation. We propose 3 strategies to combine both types of features: 1) Early fusion relies on a global feature set, containing both text- and graph-based features; 2)

Late fusion uses two separate classiﬁers dedicated to text- and graph-based features, respectively, and fetches their outputs to a third classiﬁer; and 3)

Hybrid fusion combines both previous strategies.However, our goal in the current work is to study the behavior of graph embed-ding methods on this task. Therefore, we only focus on the interactions between

User4: PTDRUser1: salut !User2: alors, ce raid?User1: je l'ai raté !User1: je dormais...User2: naaaan !User3: quoi ?!

Fig. 2

Representation of our method to build graphs from conversations. The left part is anextract of the considered conversation, which takes the form of a sequence of chat messages.The red message corresponds to the targeted message , i.e. the message we ultimately wantto classify. The right part is the corresponding conversational graph, with the author of thetargeted message in red. For readability reasons, weights and directions have been omitted inthe graph. Figure available at 10.6084/m9.ﬁgshare.7442273 under CC-BY license. the participants of the conversation, as modeled by the graphs whose extraction process was just described in Section 3.1. It is worth stressing that completelyignoring the textual content exchanged by the participants of the conversationmakes our method language-independent, and obfuscation-resistant.We select a set of standard topological measures to describe the graph in a num-ber of distinct ways, in terms of scale and scope. The scale depends on the natureof the characterized object: node or graph. Some of the measures characterize thegraph as a whole ( i.g. diameter, density), whereas other focus on individual nodes( i.g. degree, closeness). The scope corresponds to the nature of the informationused to characterize the object: micro, meso, or macroscopic. Some of the selectedmeasures leverage only local information ( i.g. transitivity, reciprocity), other con-sider the full graph ( i.g. betweenness, eccentricity) or intermediate substructures( i.g. modularity, participation coeﬃcient).The graph scale measures, denoted M , ..., M q in Figure 1, are directly used asclassiﬁcation features. The node measures, denoted m , ..., m p in the same ﬁgure,are computed for all nodes, and used to produce two diﬀerent types of features.The ﬁrst corresponds to the value obtained for the node modeling the author ofthe targeted message ( i.e. red node in Figure 2), denoted m i ( v ) in Figure 1. Thesecond is the average of this measure over all nodes in the graph, denoted (cid:104) m i (cid:105) in the ﬁgure, which is considered as a graph scale representation. In total, the fullset is constituted of 459 features, including several variants of certain topologicalmeasures. Their detailed list is available in [31].For each annotated message in our corpus, we ﬁrst extract the correspond- ing conversational graph, as explained before. We then compute the whole set oftopological measures to fully describe each one of these graphs. The graph-scalemeasures allow to characterize the whole conversation at once, whereas the node-scale measures are used to describe the position of the node corresponding tothe author of the targeted message. Finally, all of these values are used as input raph embeddings for Abusive Language Detection 11 features fetched to an SVM classiﬁer. In addition, we perform a feature ablationstudy to identify the most discriminative topological measures for the task at hand,which we call Top Features . It turns out that 9 top features are enough to reach aperformance 97% as good as the performance obtained with the whole feature seton the test set.3.3 Embedding MethodsIn this subsection, we describe the graph embedding methods that we use in our ex-periments. We found in our previous work [31] that topological measures describingthe graph at diﬀerent scales and scopes can convey complementary information,allowing to improve the performance on the classiﬁcation task. This is the reasonwhy we decided to include both whole-graph and node embedding methods in thisstudy. We selected methods that use diﬀerent strategies, and focus on preservingvarious aspects of the graph, in order to include as much diversity as possible.All implementations are from the

Karate club toolkit [36] except Node2vec, whichwas developed by E. Cohen . In our description, the names of the parameterscorrespond to those used in these toolboxes. As explained in Section 3.1, we extract a conversational graph for each targetedmessage, based on its context period. Using a description of the whole graphamounts at considering the entire conversation at once when performing the clas-siﬁcation. We found in our previous study [31] that certain graph-scale topologicalmeasures such as the

Authority score and

Reciprocity are particularly discrimina-tive for the task at hand. In this experiment, we consider whole-graph embeddingmethods as the embedding analog of graph-scale measures.

Spectral Features [20] (SF)

This method was developed to perform a classiﬁcationtask over a corpus of unweighted undirected graphs. Moreover, it assumes eachgraph is connected. Its ﬁrst step is quite standard and consists in computing thespectrum of the normalized graph Laplacian, keeping only the k smallest positiveEigenvalues. The very smallest of these values is ignored though, as it correspondsto the number of components, i.e. 1 according to the above assumption.These Eigenvalues, in ascending order, form the representation of the graph.If the graph contains less than k nodes, (resulting in less than k Eigenvalues), thevector is right-padded with zeros. Parameter k , called dimensions in the imple-mentation, therefore directly controls the size of the representation. Family of Graph Spectral Distances [39] (FGSD)

This method was also proposedto perform classiﬁcation task over a corpus of undirected graphs, but now these are weighted. Verma & Zhang designed their representation in order to characterizea graph in terms of certain of its constituting subgraphs, and so that it has theproperty of being invariant under graph isomorphism. It relies on the assumptionthat the characteristics of the graph are encoded in the set of all its node pairwise https://github.com/eliorc/node2vec2 No´e C´ecillon* et al. distances. They propose a family of graph spectral distances (FGSD) based on thespectrum of the graph Laplacian, which is able to encode both local and globalstructure properties. They select the most appropriate distance in this family, inorder to fulﬁll their objective of isomorphism-invariance, and to obtain a sparserepresentation.This results in a representation whose length depends on the graph order. Toget a ﬁxed-length vector suitable for classiﬁcation, they discretize the distributionof the obtained node pairwise distances through a histogram. Parameter-wise, theuser controls the way this histogram is computed. It is necessary to provide therange covered by the histogram ( hist_range ) and its number of bins ( hist_bins ). Graph2vec [27] (G2V)

This method was not designed for a speciﬁc task, but wasevaluated on graph classiﬁcation and clustering benchmarks. Unlike the previousmethods, graph2vec must be trained on a corpus of graphs, as it relies on unsuper-vised learning through a neural network. It is designed analogically to documentembedding methods proposed in NLP. As these methods leverage the fact that adocument is formed of a sequence of words, Narayanan et al . consider a graph asthe set of subgraphs surrounding each node.The algorithm takes the set of graphs to represent, and outputs their represen-tations by applying a two-step process. It ﬁrst identiﬁes the subgraphs surroundingeach node and constituting the graph. More precisely, it looks for so-called rooted subgraph, i.e. node neighborhoods of a certain order. Second, these subgraphs areconsidered as the vocabulary and fetched to a doc2vec SkipGram [21] model. To re-duce the computational cost, the method follows a negative sampling strategy ( i.e. at each iteration, the model updates the representations of only a ﬁxed number ofnegative samples).This embedding method captures structural equivalence , i.e. graphs whosenodes tend to possess this form of similarity will be close in the representationspace. In addition, it is able to take into account an extra input correspondingto a label associated to each node. Parameter-wise, the user must specify thedegree of the rooted-subgraphs ( wl_iterations ), while the rest of the parame-ters controls the SkipGram model: size of the representation ( dimensions ), down-sampling frequency ( down_sampling ), number of epochs ( epochs ), learning rate( learning_rate ) and minimal count of graph feature occurrences ( min_count ). In the conversational graph extracted from the context of the targeted message,all nodes are not equal. As mentioned in Section 3.1, one of them represents theauthor of the targeted message, which we assume plays a particular role if anabuse is occurring at this moment of the conversation. In [31] we experimentedwith a node-based representation of the conversation, consisting in characterizingindividually this node of interest (by opposition to the whole graph), through aselection of nodal topological measures such as

Strength and

Closeness centrality . The node embedding methods presented in this section can be considered as theembedding analog of these measures in the present study.

DeepWalk [33] (DW)

DeepWalk relies on another analogy between graph andtext, allowing to adapt a neural network-based approach originating from NLP. raph embeddings for Abusive Language Detection 13

It takes a graph as input and uses a set of random walks of ﬁxed length as aproxy to represent the graph structure. The procedure samples a certain number ofuniform random walks starting from each node, which are considered as analog to aset of sentences, whereas the nodes set corresponds to the vocabulary. DeepWalkuses a SkipGram model to update the node representations by predicting theirneighborhood ( i.e. context). The obtained representation captures the modularstructure of the graph.The parameters of this method include the size of the neighborhood that wewant to consider ( window_size ), the learning rate ( learning_rate ), the num-ber of epochs ( epochs ) and the minimal count of node occurrences for includingthe node in the model ( min_count ). Other parameters correspond to the size ofthe generated embeddings ( dimensions ), the number of random-walks starting ateach node ( walk_number ) and their maximum length ( walk_length ). The last twoparameters are typical of random walk-based approaches.

Node2vec [17] (N2V)

Node2vec is designed to preserve the node neighborhood inthe space of representation. It follows the main idea of

DeepWalk but uses bi-ased random-walks by introducing weights to the transition probabilities betweennodes. The goal with this change is to improve the sampling step and get ran-dom walks that better model node neighborhoods. The bias allows controlling thebehavior of the random walker, resulting in a trade-oﬀ between purely breadth-ﬁrst (exploring the closer nodes ﬁrst) and depth-ﬁrst (favoring increasingly distantnodes) samplings. The former tends to produce representation that preserve struc-tural equivalence, whereas the latter provides a wider view of the neighborhood.

Node2vec has the same parameters as

DeepWalk plus two extra ones. Thereturn parameter p controls the likelihood of immediately revisiting a node duringthe walk, and the in-out parameter q controls the balance between the breadth-ﬁrstand depth-ﬁrst strategies. Walklets [34] (WL)

Walklets is an extension of DeepWalk which aims at explicitlymodeling multi-scale relationships, i.e. combine distinct views of node relationshipsat diﬀerent granularity levels. Walklets introduces a key change in the random walksampling algorithm, as the walk can now skip some nodes to reach farther partsof the network. This allows reaching distant nodes while keeping walk lengthsshort and tractable. Implicitly, this amounts to sampling diﬀerent powers of theadjacency matrix. Like DeepWalk, the random walks are the inputs of a SkipGrammodel. It creates a representation for each power of the adjacency matrix that isexplored ( i.e. each size of skip ) and the output representation is the result of theirconcatenation.The method has the same set of parameters as

DeepWalk . However, in Walklets,the window size denotes the power order of the adjacency matrix to use ( i.e. thesize of skips in random walks) and thus, the number of distinct representations thatthe model learns. The dimension corresponds to the size of each representation.

The size of the global embedding generated is the product of the values of thesetwo parameters.

BoostNE [22] (BNE) BoostNE also learns multiple graph representations at dif-ferent granularity levels, but unlike Walklets, it relies on matrix factorization. Itapplies the principle of gradient boosting to perform successive factorizations of an original target matrix denoted as node connectivity matrix. Each one resultsin a representation corresponding to an increasingly ﬁner granularity. The ﬁnalembedding is obtained by concatening these representations.The parameters of this method are the following. First, the user must specifythe number of granularity levels considered ( iterations ), as well as two param-eters controlling the non-negative matrix factorization step ( order and alpha ).Finally, similarly to

Walklets , parameter dimensions corresponds to the size of therepresentation.

GraphWave [15] (GW)

This representation was designed to preserve the struc-tural roles of nodes while being robust to small perturbations in the graph struc-ture. It leverages heat wavelet diﬀusion patterns to estimate a multidimensionalrepresentation. The process mimics a physical process consisting in propagatingsome energy through the graph structure, starting from the node of interest. Theway this energy is diﬀused over the graph is assumed to characterize the nodeand its neighborhood. Formally, it is represented by the distribution of waveletcoeﬃcients, which is sampled to get the proper vector representation of the nodes.Parameter-wise, a scaling parameter allows controlling the hear kernel ( heat -coeﬃcient ), which corresponds in terms of graph structure to the radius of the con-sidered node neighborhood. The number of points used to sample the distributionof wavelet coeﬃcients ( sample number ) corresponds to the size of the representa-tion. The granularity of the grid used to perform this sampling is controlled byparameter step_size . Wavelet calculation can be performed exactly of approx-imately ( mechanism ), in which case the parameters switch and approximation allow controlling its precision.

In this section, we present the data and the experimental protocol necessary tocarry out our experiments (Section 4.1), and explain how we ﬁxed the many pa-rameters of the considered graph embedding methods (Section 4.2). We then seekto compare the performance of various types of graph embedding methods (Sec-tion 4.3). We ﬁnally explore the complementarity of such methods with our baselinebased on topological measures (Section 4.4).4.1 Data and Experimental Protocol

Data

The raw data is a proprietary database containing approximately 4 millionmessages, written in French and posted on the in-game chat of SpaceOrigin , aMassively Multiplayer Online Role-Playing Game (MMORPG).Among all the exchanged messages, 655 have been reported as being abusive by other players, and conﬁrmed as such by at least one human moderator. Theyconstitute the Abuse class. Non-abusive messages constitute most of the dataset(more than 99% of the messages): in this case, it is standard to use only a part ofthe available data in order to get balanced training and testing classes, and thus https://play.spaceorigin.fr/ raph embeddings for Abusive Language Detection 15 prevent the classiﬁer from being biased toward the majority class. We constitutethe Non-abuse class by randomly sampling the same number of messages fromthe ones that have not been reported, with the constraint that a message mustnot appear in the same conversation as an already selected message [31]. As aresult, our dataset is composed of 1,320 independent messages, equally distributedbetween the

Abuse and

Non-abuse classes. Equally distributing the dataset leadsto a higher performance and is common in abuse detection literature [28,37,4,46].Additionally, we construct a small development set of 120 messages followingthe same procedure, meant to be used when estimating the embedding methodsparameters. This set is also balanced. We associate each message to its surroundingcontext ( i.e. messages posted before and after it), as explained in Section 3.1.

Experimental Protocol

We conduct our experiments on a binary classiﬁcation taskto detect whether a message belongs to the

Abuse or the

Non-abuse class. We applythe graph extraction process developed in Section 3.1 to our dataset in order toconstruct a conversational graph for each one of its messages. It is controlled bycertain parameters, for which we use the best value identiﬁed in our previouswork [31], which are: a sliding window of 10 messages, a context period of 850messages, and link computed through a linear assignment strategy. On average,the extracted graphs contain 46 nodes and 500 edges. We make this dataset of1 ,

320 conversational graphs publicly available online .We use the 8 embedding methods presented in Section 3.3 in addition to thebaseline approach (topological features) described in Section 3.2 to generate vectorrepresentations of these graphs. For the whole-graph embedding methods, we usethe complete representation, whereas for node embedding methods, we only usethe representation of the node representing the author of the targeted message (cf.Section 3.1).We input these representations to an SVM to perform the classiﬁcation. Weuse the implementation provided by the Sklearn toolkit [32], under the name SVC(C-Support Vector Classiﬁcation). As an alternative, we also experimented withSklearn’s implementation of the multilayer perceptron (MLP). However, it yieldsvery similar performances compared to the SVM, so we decided to not presentthem here. We suppose that the size of our dataset lowers the eﬀectiveness of suchapproaches neural approaches, at least on our task. We conduct our experimentson an Intel Core i3-3250 3.5 GHz CPU. Because of the relatively small size of ourdataset, we set-up our experiments using a 10-fold cross-validation. We use 70%of the data for training and the remaining 30% for testing. Each fold is balancedin terms of classes.The classiﬁcation performance is expressed in terms of micro F -measure, theharmonic mean of the precision and recall. Since our dataset is balanced, microand macro F -measure are equivalent in our experiments. In the rest of the paper,we use F -measure to refer to them collectively. DOI: F - m e a s u r e FGSDSFGraph2vecDeepWalkWalkletsBoostNEGraphWaveNode2vec

Fig. 3

Performance in terms of F -measure, as a function of the representation dimension.Figure available at 10.6084/m9.ﬁgshare.7442273 under CC-BY license. on the development set to determine their optimal values. As this is not the mainobjective of this work, we only focus on the main results here.During our experimentation, we found that most of the parameters have onlya limited impact on the performances of embedding methods on our abuse detec-tion task. This includes the dimension of the generated representation, which isthe only parameter common to all methods. Figure 3 shows the performance as afunction of this dimension, for all the methods. Performances are computed on thedevelopment set. Note that in this ﬁgure, we consider the dimension of the outputrepresentation ( i.e. from the extracted embedding representation of graphs) andnot the dimension of individual embeddings which are then concatenated in meth-ods such as Walklets and BoostNE. As expected, for most methods, a dimensiontoo small seems to lack discriminative power, as there is not enough informationto reliably represent the graph structure. Conversely, as our dataset is composedof relatively small graphs (a few hundred nodes at most), a dimension too largeappears does not improve the performance, and just increase the computationalcost. Put diﬀerently, it seems that we do not need a very large representation ofthe graph to reach the best performance on this task. This is consistent with ourﬁndings from our previous work on the baseline, in which we showed that carefullyselecting 9 Top Features among hundreds was enough to keep a performance 97%as good as the original performance on the test set.The exact parameter settings used for all embedding methods are described inTable 2. Graph2vec is able to take into account a label associated to each node: weuse the ID of the author modeled by the node. For BoostNE, it is worth stressing that parameter order has a strong eﬀect on the performance, as increasing thevalue of this parameter lowers the performance. Using larger values might bebeneﬁcial on larger graphs, though. For GraphWave, performance stays relativelyconstant with a sample number higher than 100, but strongly decreases with alower value. raph embeddings for Abusive Language Detection 17 Table 2

Parameters of the 8 graph embedding methods.

Parameter FGSD SF G2V DW WL N2V BNE GW dimensions

200 128 128 128 32 128 8 100 hist_range

10 - - - - - - - wl_iterations - - 1 - - - - - down_sampling - - 10 − - - - - - learning_rate - - 0.06 0.05 0.05 - - - epochs - - 12 - - - - - min_count - - 1 1 1 - - - window_size - - - 10 4 10 - - walk_number - - - 5 5 10 - - walk_length - - - 80 80 20 - - p - - - - - 0.95 - - q - - - - - 1.0 - - iterations - - - - - - 16 - order - - - - - - 1 - alpha - - - - - - 0.01 - step_size - - - - - - - 0.2 heat_coefficient - - - - - - - 0.5 approximation - - - - - - - 100 switch - - - - - - - 1000 Table 3 F -measures obtained for the baseline and the 8 graph embedding methods. The leftpart corresponds to results obtained with the embedding methods alone, whereas the right partshows how they perform when combined with the topological measures used in the baseline.In the Scale column, WG stands for Whole-Graph and N for Node. Scale Method Embeddings only Embeddings & Topo. meas.Dimension F -measure Dimension F -measure WG FGSD 200 77.06 659 87.27SF 128 79.88 587 88.34Graph2vec 128 81.91 587 89.16N DeepWalk 128 78.85 587 87.73Node2vec 128 83.70 587 89.03Walklets 128 79.49 587 88.15BoostNE 136 63.28 595 86.54GraphWave 200 83.04 659 87.97–

Baseline

Dimension: 459 F -measure: 88.08 Embeddings Only

The ﬁrst two columns of Table 3 present the F -measure valuesobtained by our baseline and the 8 embedding methods described in Section 3.3.It also shows the dimension of the vector representations used to perform theclassiﬁcation. The last two columns correspond to results obtained when we usesimultaneously the embedding methods and the topological measures from our baseline (described in Section 3.2), by concatenating their vector representation.The reported performances are obtained following the protocol described in Sec-tion 4.1.We ﬁrst focus on the results obtained without the topological measures (embed-dings only). They demonstrate how appropriate the considered embedding meth- ods are for the task at hand. Our ﬁrst observation is that there is no clear distinc-tion between node and whole-graph approaches in terms of F -measure. Since thewhole graph is representing the message and its context, one could have thoughtthat embedding the whole graph could allow capturing more important informa-tion than a single node embedding. However, it seems that these graphs can bewell-characterized by focusing on the embedding representing only the node corre-sponding to the author of the targeted message . This could mean that the relativeposition of this node in the graph is enough to characterize the whole conversation,and that node-level embeddings are able to capture such information.The baseline yields the best performance, though. This is not surprising be-cause this hand-crafted set of features was speciﬁcally designed for this task anddataset, whereas the embedding methods are somewhat generic. An interestingresult is that Node2vec and GraphWave (node scale approaches), and Graph2vec(whole-graph scale) yield relatively good performance. These three approacheshave several advantages over the baseline. First, they are not speciﬁcally designedfor this task or dataset and are hence more likely to be eﬃcient in other settings.Second, embedding methods are more scalable than hand-crafted sets of features.Computing the topological measures used in our baseline is computationally veryexpensive, with a total runtime of more than 8 hours. On the other hand, it onlytakes a few minutes to deal with Graph2vec, GraphWave and Node2vec on thesame computer, which makes them a lot more time-eﬃcient. The other methods,except BoostNE, obtain correct performances with a F -measure around 79%. SFand FGSD, which operate on the whole graph, might be penalized by the small sizeof our dataset and by the fact that graphs have approximately the same size andthus, possibly similar structures. DeepWalk is less eﬃcient than Node2vec, whichis in line with other studies [16,17]. The Walklets algorithm learns multi-scale re-lationships in the graph. However, such relationships might not be very developedin our graphs, which could explain its lower performance. This observation couldalso be the reason of the very poor performances of BoostNE, which also operateson diﬀerent granularity levels. Embeddings & Topological Measures

Now, to study the complementarity betweenthe baseline and the embedding methods, we propose to combine these features fol-lowing the three fusion strategies proposed in [11] and summarized in Section 3.2.An important diﬀerence with our previous work is that instead of using thesestrategies to combine topological measures and text features, our aim in this workis to combine topological measures and graph embeddings, since we do not use thetextual content of the exchanged messages.Our experiments show that the three strategies lead to very similar perfor-mances on our task. Hence, in the remaining of this paper we present only theresults obtained using one of them. We choose the

Early fusion , because it is thesimpler of the three, and also because it eases interpreting the results, in partic-ular regarding the feature ablation. In practice, the embedding generated by theconsidered embedding method is concatenated with the topological measures com- puted for the graph. We perform a new classiﬁcation following exactly the sameprotocol as before, and report the obtained performance in the last two columnsof Table 3.Graph2vec, SF and Node2vec are the 3 methods that improve the baselineperformance when combined with it, up to a 89.16 F -measure for Graph2vec. raph embeddings for Abusive Language Detection 19 This result is critical, as it proves that the information captured by these threeembedding methods is not similar, at least partially, to the information captured bythe baseline. Furthermore, the additional information captured in the embeddingsis useful for the classiﬁcation task, since combining it with the baseline improvesthe classiﬁcation performance. This result acknowledges the assumption that graphembedding methods can be used to detect abusive messages and that they can evencapture useful information that is not caught by a hand-crafted set of measures.When combined with the baseline, Walklets and Graphwave yield F -measuresalmost similar to what is obtained by the baseline alone. This seems to indicatethat the generated embeddings contain information that is already captured bythe baseline features, or that is useless for this speciﬁc classiﬁcation task. However,even if the embeddings do not improve the performance, they do not introduceincorrect information ( i.e. noise) in the representations as the performances staysapproximately the same with and without them.Contrariwise, DeepWalk, FGSD and BoostNE combined with the baselinesyield a F -measure that is inferior to that of the baseline on its own. It seems thatthe representations generated by these methods introduce some incorrect informa-tion when combined with the baseline, which causes a loss of performance. It isworth highlighting that these approaches were already the three worst performingmethods when used without the baseline.4.4 Feature AblationIn our previous work [11], we identiﬁed, among the topological measures used inthe baseline, the most discriminative features for our classiﬁcation task, which wecalled Top Features . For this purpose, we used a standard feature ablation process.As embedding methods provide vector representations in which dimensions arenot directly interpretable, it does not make sense to apply the same method here.Instead, we propose to study whether the most important topological measuresfrom our baseline are well captured by the embedding methods.To this end, we compare the F -measure score obtained by each embeddingmethod on its own, with the score obtained by using a representation composedof the same embedding completed by one of the Top Features. Figure 4 shows thediﬀerence between these two scores for all embedding methods and Top Features. Ifthe performance signiﬁcantly increases, we conclude that the topological measurewas not captured by the embedding. This cases are represented in red in theﬁgure. If the performance stays the same or increases by less than 0.50% with theadditional feature, we conclude that the structural property corresponding to thistopological measure is well captured by the embedding (represented in green). Ifthe performance increase is higher than 0.50% but not statistically signiﬁcant, weconclude that the property is only partially captured by the embedding method(represented in blue). An interesting result shown by Figure 4 is that some topological measures seemto be well captured by all, or almost all the embedding methods ( e.g.

PageR-ank centrality, Vertex count, Closeness centrality at node and graph level andReciprocity). Contrariwise, the average Coreness score, which is considered as agraph-scale measure, is partially captured by SF, a graph embedding method and F G S D S F G V D W N V W L B N E G W PagerankStrengthClosenessHub scoreAverage CorenessVertex CountAverage ClosenessAverage AuthorityReciprocity 0.31 0.0 0.07 0.1 -0.0 -0.13 4.36 0.052.76 1.06 4.16 1.84 1.08 1.77 15.7 0.410.05 0.05 -0.03 -0.08 0.1 0.13 0.0 0.821.52 0.79 4.58 1.91 0.87 1.48 5.94 0.385.11 1.71 4.67 2.28 0.91 2.28 17.05 0.690.28 -0.03 0.25 0.03 0.02 -0.05 0.41 0.230.52 0.08 0.13 0.22 0.35 0.33 1.24 1.071.37 -0.03 0.05 -0.03 0.02 0.02 2.65 -0.130.03 0.08 0.2 0.12 0.14 0.12 -0.15 -0.07

Fig. 4

Topological measures captured (green), partially captured (blue) or not captured (red)by the embedding approaches. The ﬁrst 4 topological measures are computed at the

Graph level and the last 5 topological measures are computed at the

Node level. Each value is thediﬀerence between the F -measure score obtained by the embedding method on its own andthe score obtained by the embedding method completed by the corresponding Top Feature.Figure available at 10.6084/m9.ﬁgshare.7442273 under CC-BY license. by Node2vec and GraphWave, two node embeddings methods. The latter is alsothe only method that completely captures the Strength centrality and Hub score.SF, Node2vec and GraphWave, which are among the best performing methodsin Table 3 are able to capture or partially capture all the important measuresstudied. However, they yield a F -measure much lower than the baseline. Therefore,we can suppose that these methods might not capture other properties of thegraph which are less important but improve the performance when all combined.Graph2vec fails to capture some measures ( i.e. average Coreness score, Strengthcentrality and Hub score). However, as suggested by the result of its combinationwith the baseline, Graph2vec might miss some important measures of the baselinebut this method is able to capture other important properties of the graph whichare not conveyed by the baseline features. This is why Graph2vec is the bestperforming method when considering the combination with the baseline. An other interesting result of this study is that there is no clear diﬀerence inthe type of information captured by node and whole-graph embedding approaches.Node embedding methods are able to capture certain graph-scale topological mea-sures and whole-graph embedding methods can capture some node-scale measures.This property may result from the relatively small size of our graphs, as the sec-ond order neighborhood of a node might include a majority of the nodes in the raph embeddings for Abusive Language Detection 21 graph. Thus, diﬀerences between node and whole-graph embedding methods arenot as important as what they could be on larger graphs. Furthermore, our graphsare centered around a speciﬁc node. This speciﬁcity might help the whole graphembeddings to capture better node-level information.

In this paper, we use graph embedding representations to tackle the problem ofautomatic abuse detection in online textual messages. We compare 8 methods op-erating on nodes and whole graphs to ﬁnd the category of embeddings which ﬁtsthe best the needs of this task. Our results show that Node2vec, GraphWave andGraph2vec are the methods that perform the best on this task. We compare theperformance of these 8 graph embedding methods with a baseline previously de-signed using a feature engineering approach. With a 88.08 F -measure, this baselineoutperforms the embedding methods, but the top ones obtain promising results: upto 83.70 with the Node2vec approach. We also study the complementarity betweenembedding methods and the topological measures used in the baseline. Combiningthem with Graph2vec allows to improve the performance up to a 89.16 F -measure.Finally, we study which aspects of the graph structure each embedding methodis able to capture. We ﬁnd that methods operating on nodes and whole graphsare all able to include the information conveyed by certain topological measuresdeﬁned both at node and graph scales.A limitation of this work is the small size of our dataset (1,320 messages). Ourapplication could beneﬁt a larger dataset with more variety and examples of abu-sive messages. We have already started working in this direction, by proposing andfreely distributing WAC , a corpus based on Wikipedia edit discussion pages [10].It combines and improves two preexisting corpora to provide simultaneously com-ments annotated in terms of abuse and their surrounding conversation. This newcorpus could be a larger ﬁeld of experimentation, with around 383k annotatedmessages including 51k abusive ones distributed over 3 classes of abuse. Anotherlimitation is the relatively small size of the graphs that we use to model conver-sations, which is likely to reduce the diﬀerences between node and whole-graphembedding methods. Working on larger graphs could help better distinguish thediﬀerences between these two types of embedding methods.In the current work, we use static graphs to represent conversations. However,as our dataset contains details about the time at which messages were posted, apossible future work is to integrate a temporal aspect in our study. For example,by constructing sequences of embeddings to represent the evolution of conversa-tion over time, or to experiment with representation able to simultaneously embedstructural and temporal information. Another track is to leverage the content ofmessages through text embeddings, as we did previously with a feature engineer-ing approach. Here too, it is possible to consider using separate embeddings for structure and text, or speciﬁc embeddings able to combine both types of informa-tion at once. Finally, another interesting track is to study the complementarity ofdiﬀerent categories of graph embedding methods, for example by simultaneouslyusing node, edge and whole graph representations. DOI: 10.6084/m9.ﬁgshare.112991182 No´e C´ecillon* et al.

Conﬂict of interest

On behalf of all authors, the corresponding author states that there is no conﬂictof interest.

References

1. Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josifovski, V., Smola, A.J.: Distributedlarge-scale natural graph factorization. In: 22nd International Conference on World WideWeb, pp. 37–48 (2013). DOI 10.1145/2488388.24883932. Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detectionin tweets. In: 26th International Conference on World Wide Web Companion, pp. 759–760(2017). DOI 10.1145/3041021.30542233. Bai, Y., Ding, H., Qiao, Y., Marinovic, A., Gu, K., Chen, T., Sun, Y., Wang, W.: Un-supervised inductive graph-level representation learning via graph-graph proximity. In:28th International Joint Conference on Artiﬁcial Intelligence, pp. 1988–1994 (2019). DOI10.24963/ijcai.2019/2754. Balci, K., Salah, A.A.: Automatic analysis and identiﬁcation of verbal aggression andabusive behaviors for online social games. Computers in Human Behavior , 517–526(2015). DOI 10.1016/j.chb.2014.10.0255. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques forembedding and clustering. In: Advances in Neural Information Process-ing Systems 14, pp. 585–591 (2002). URL http://papers.nips.cc/paper/1961-laplacian-eigenmaps-and-spectral-techniques-for-embedding-and-clustering.pdf

6. Cai, H., Zheng, V.W., Chang, K.C.C.: A comprehensive survey of graph embedding: Prob-lems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineer-ing (9), 1616–1637 (2018). DOI 10.1109/TKDE.2018.28074527. Chatzakou, D., Kourtellis, N., Blackburn, J., De Cristofaro, E., Stringhini, G., Vakali, A.:Mean birds: Detecting aggression and bullying on twitter. In: 2017 ACM on Web ScienceConference, pp. 13–22 (2017). DOI 10.1145/3091478.30914878. Chen, H., Perozzi, B., Hu, Y., Skiena, S.: Harp: Hierarchical representation learning fornetworks. In: 32nd AAAI Conferenceon Artiﬁcial Intelligence (2018). URL

9. Chen, Y., Zhou, Y., Zhu, S., Xu, H.: Detecting oﬀensive language in social media toprotect adolescent online safety. In: International Conference on Privacy, Security, Riskand Trust and International Conference on Social Computing, pp. 71–80 (2012). DOI10.1109/SocialCom-PASSAT.2012.5510. C´ecillon, N., Labatut, V., Dufour, R., Linar`es, G.: Wac: A corpus of wikipedia conversa-tions for online abuse detection. In: 12th International Conference on Language Resourcesand Evaluation (2020)11. C´ecillon, N., Labatut, V., Dufour, R., Linar`es, G.: Abusive language detection in onlineconversations by combining content- and graph-based features. Frontiers in Big Data ,8 (2019). DOI 10.3389/fdata.2019.00008. URL

12. Dadvar, M., Trieschnigg, D., Ordelman, R., de Jong, F.: Improving cyberbullying detectionwith user context. In: 35th European Conference on IR Research, vol. 7814 (2013). DOI10.1007/978-3-642-36973-5 6213. Dinakar, K., Reichart, R., Lieberman, H.: Modeling the detection of textual cyberbullying.In: 5th International AAAI Conference on Weblogs and Social Media / Workshop on theSocial Mobile Web, pp. 11–17 (2011). URL

14. Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., Bhamidipati, N.: Hatespeech detection with comment embeddings. In: 24th international conference on worldwide web, pp. 29–30 (2015). DOI 10.1145/2740908.274276015. Donnat, C., Zitnik, M., Hallac, D., Leskovec, J.: Learning structural node embeddingsvia diﬀusion wavelets. In: 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining, p. 1320–1329 (2018). DOI 10.1145/3219819.3220025raph embeddings for Abusive Language Detection 2316. Goyal, P., Ferrara, E.: Graph embedding techniques, applications, and performance: Asurvey. Knowledge-Based Systems , 78 – 94 (2018). DOI https://doi.org/10.1016/j.knosys.2018.03.02217. Grover, A., Leskovec, J.: Node2vec: Scalable feature learning for networks. In: 22nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864(2016). DOI 10.1145/2939672.293975418. Hou, B., Wang, Y., Zeng, M., Jiang, S., Mengshoel, O.J., Tong, Y., Bai, J.: Customizedgraph embedding: Tailoring embedding vectors to diﬀerent applications. arXiv (2019).URL http://arxiv.org/abs/1911.09454

19. Kipf, T.N., Welling, M.: Semi-supervides classiﬁcation with graph convolutional networks.In: ICLR (2017). URL https://arxiv.org/pdf/1609.02907.pdf

20. de Lara, N., Pineau, E.: A simple baseline algorithm for graph classiﬁcation. arXiv (2018).URL https://arxiv.org/pdf/1810.09155.pdf

21. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: 31stInternational Conference on International Conference on Machine Learning, vol. 32, p.II–1188–II–1196 (2014). URL http://proceedings.mlr.press/v32/le14.html

22. Li, J., Wu, L., Guo, R., Liu, C., Liu, H.: Multi-level network embedding with boosted low-rank matrix approximation. In: 2019 IEEE/ACM International Conference on Advancesin Social Networks Analysis and Mining, p. 49–56 (2019). DOI 10.1145/3341161.334286423. Liang, X., Li, D., Song, M., Madden, A., Ding, Y., Bu, Y.: Predicting biomedical relation-ships using the knowledge and graph embedding cascade model. PLoS ONE (2019).DOI https://doi.org/10.1371/journal.pone.021826424. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Eﬃcient estimation of word representationsin vector space. In: ICLR Workshop Track Proceedings (2013). URL https://arxiv.org/pdf/1301.3781.pdf

25. Mishra, P., Del Tredici, M., Yannakoudakis, H., Shutova, E.: Author proﬁling for abusedetection. In: 27th International Conference on Computational Linguistics, pp. 1088–1098(2018). URL

26. Mousavi, S.F., Safayani, M., Mirzaei, A., Bahonar, H.: Hierarchical graph embedding invector space by graph pyramid. Pattern Recognition (C), 245–254 (2017). DOI 10.1016/j.patcog.2016.07.04327. Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.:graph2vec: Learning distributed representations of graphs. In: 13th International Work-shop on Mining and Learning with Graphs (MLG) (2017). URL

28. Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language detectionin online user content. In: 25th International Conference on World Wide Web, pp. 145–153(2016). DOI 10.1145/2872427.288306229. Okky Ibrohim, M., Budi, I.: A dataset and preliminaries study for abusive languagedetection in indonesian social media. Procedia Computer Science , 222 – 229(2018). DOI 10.1016/j.procs.2018.08.169. URL

30. Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transitivity preserving graphembedding. In: 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, pp. 1105–1114 (2016). DOI 10.1145/2939672.293975131. Papegnies, E., Labatut, V., Dufour, R., Linar`es, G.: Conversational networks for automaticonline moderation. IEEE Trans. Comput. Social Systems (1), 38–55 (2019). DOI 10.1109/TCSS.2018.288724032. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journalof Machine Learning Research , 2825–2830 (2011). URL

33. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In:20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp. 701–710 (2014). DOI 10.1145/2623330.262373234. Perozzi, B., Kulkarni, V., Skiena, S.: Don’t walk, skip! online learning of multi-scale net-work embeddings. In: 2017 IEEE/ACM International Conference on Advances in SocialNetworks Analysis and Mining, pp. 258–265 (2017). DOI 10.1145/3110025.311008635. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding.Science (5500), 2323–2326 (2000). DOI 10.1126/science.290.5500.23234 No´e C´ecillon* et al.36. Rozemberczki, B., Kiss, O., Sarkar, R.: An api oriented open-source python frameworkfor unsupervised learning on graphs. arXiv (2020). URL https://arxiv.org/pdf/2003.04819.pdf

37. Salminen, J., Almerekhi, H., Milenkovi´c, M., Jung, S., An, J., Kwak, H., Jansen, B.J.:Anatomy of online hate: Developing a taxonomy and machine learning models for iden-tifying and classifying hate in online news media. In: International AAAI Conference onWeb and Social Media (ICWSM 2018) (2018). URL

38. Tsitsulin, A., Mottin, D., Karras, P., Bronstein, A., M¨uller, E.: Netlsd: Hearing the shapeof a graph. In: 24th ACM SIGKDD International Conference on Knowledge Discovery &Data Mining, pp. 2347–2356 (2018). DOI 10.1145/3219819.321999139. Verma, S., Zhang, Z.L.: Hunt for the unique, stable, sparse and fast fea-ture learning on graphs. In: Advances in Neural Information Process-ing Systems 30, pp. 88–98 (2017). URL http://papers.nips.cc/paper/6614-hunt-for-the-unique-stable-sparse-and-fast-feature-learning-on-graphs.pdf

40. Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pp. 1225–1234 (2016).DOI 10.1145/2939672.293975341. Wang, H., Wang, J., Wang, J., Zhao, M., Zhang, W., Zhang, F., Xie, X., Guo, M.: Graph-gan: Graph representation learningwith generative adversarial nets. In: 32nd AAAI Con-ference on Artiﬁcial Intelligence, pp. 2508–2515 (2018). URL

42. Warner, W., Hirschberg, J.: Detecting hate speech on the world wide web. In: SecondWorkshop on Language in Social Media, pp. 19–26 (2012). URL

43. Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for hatespeech detection on twitter. In: NAACL Student Research Workshop, pp. 88–93 (2016).URL

44. Xiang, G., Fan, B., Wang, L., Hong, J., Rose, C.: Detecting oﬀensive tweets via topicalfeature discovery over a large scale twitter corpus. In: 21st ACM International Conferenceon Information and Knowledge Management, pp. 1980–1984 (2012). DOI 10.1145/2396761.239855645. Yan, S., Xu, D., Zhang, B., Zhang, H., Yang, Q., Lin, S.: Graph embedding and extensions:A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysisand Machine Intelligence , 40–51 (2007)46. Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection ofharassment on web 2.0. In: WWW Workshop: Content Analysis in the Web 2.0, pp. 1–7(2009). URL, 40–51 (2007)46. Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection ofharassment on web 2.0. In: WWW Workshop: Content Analysis in the Web 2.0, pp. 1–7(2009). URL