[PDF] A study of text representations in Hate Speech Detection

Abstract

The pervasiveness of the Internet and social media have enabled the rapid and anonymous spread of Hate Speech content on microblogging platforms such as Twitter. Current EU and US legislation against hateful language, in conjunction with the large amount of data produced in these platforms has led to automatic tools being a necessary component of the Hate Speech detection task and pipeline. In this study, we examine the performance of several, diverse text representation techniques paired with multiple classification algorithms, on the automatic Hate Speech detection and abusive language discrimination task. We perform an experimental evaluation on binary and multiclass datasets, paired with significance testing. Our results show that simple hate-keyword frequency features (BoW) work best, followed by pre-trained word embeddings (GLoVe) as well as N-gram graphs (NGGs): a graph-based representation which proved to produce efficient, very low-dimensional but rich features for this task. A combination of these representations paired with Logistic Regression or 3-layer neural network classifiers achieved the best detection performance, in terms of micro and macro F-measure.

Full PDF

aa r X i v : . [ c s . C L ] F e b A study of text representations for Hate SpeechDetection ⋆ Chrysoula Themeli , George Giannakopoulos , and Nikiforos Pittaras , Department of Informatics and Telecommunications,National and Kapodistrian University of Athens { cthemeli, npittaras } @di.uoa.gr NCSR Demokritos, Athens, Greece { ggianna, pittarasnikif } @iit.demokritos.gr Abstract.

The pervasiveness of the Internet and social media have en-abled the rapid and anonymous spread of Hate Speech content on mi-croblogging platforms such as Twitter. Current EU and US legislationagainst hateful language, in conjunction with the large amount of dataproduced in these platforms has led to automatic tools being a neces-sary component of the Hate Speech detection task and pipeline. In thisstudy, we examine the performance of several, diverse text representa-tion techniques paired with multiple classiﬁcation algorithms, on theautomatic Hate Speech detection and abusive language discriminationtask. We perform an experimental evaluation on binary and multiclassdatasets, paired with signiﬁcance testing. Our results show that sim-ple hate-keyword frequency features (BoW) work best, followed by pre-trained word embeddings (GLoVe) as well as N-gram graphs (NGGs): agraph-based representation which proved to produce eﬃcient, very low-dimensional but rich features for this task. A combination of these rep-resentations paired with Logistic Regression or 3-layer neural networkclassiﬁers achieved the best detection performance, in terms of microand macro F-measure.

Keywords: hate speech · natural language processing · classiﬁcation · social media Hate Speech is a common aﬄiction in modern society. Nowadays, people cancome across Hate Speech content even more easily through social media plat-forms, websites and forums containing user-created content. The increase of theuse of social media gives individuals the opportunity to easily spread hateful con-tent and reach a number of people larger than ever before. On the other hand,social media platforms like Facebook or Twitter want to both comply with leg-islation against Hate Speech and improve user experience. Therefore, they needto track and remove Hate Speech content from their websites eﬃciently. ⋆ Supported by NCSR Demokritos, and the Department of Informatics and Telecom-munications, National and Kapodistrian University of Athens Ch. Themeli et al.

Due to the large amount of data transmitted through these platforms, dele-gating such a task to humans is extremely ineﬃcient. A usual compromise is torely on user reports in order to review only the reported posts and comments.This is also ineﬀective, since it relies on the users’ subjectivity and trustwor-thiness, as well as their ability to thoroughly track and ﬂag such content. Dueto all of the above, the development of automated tools to detect Hate Speechcontent is deemed necessary. The goal of this work is: (i) to study diﬀerent textrepresentations and classiﬁcation algorithms in the task of Hate Speech detec-tion; (ii) evaluate whether the n-gram graphs representation [10] can constitutea rich/deep feature set (as e.g. in [20]) for the given task.The structure of the paper is as follows. In section 2 we deﬁne the hate speechdetection problem, while in section 3 we discuss related work. We overview ourstudy approach and elaborate on the proposed method in section 4. We thenexperimentally evaluate the performance of diﬀerent approaches in Section 5,concluding the paper in Section 6, by summarizing the ﬁndings and proposingfuture work.

The ﬁrst step to Hate Speech detection is to provide a clear and concise deﬁnitionof Hate Speech. This is important especially during the manual compilation ofHate Speech detection datasets, where human annotators are involved. In theirwork, the authors of [13] have asked three students of diﬀerent race and same ageand gender to annotate whether a tweet contained Hate Speech or not, as well asthe degree of its oﬀensiveness. The agreement was only 33%, showing that HateSpeech detection can be highly subjective and dependent on the educationaland/or cultural background of the annotator. Thus, an unambiguous deﬁnitionis necessary to eliminate any such personal bias in the annotation process.Usually, Hate Speech is associated with insults or threats. Following the def-inition provided by [3], “it covers all forms of expressions that spread, incite,promote or justify racial hatred, xenophobia, antisemitism or other forms ofhatred based on intolerance”. Moreover, it can be “insulting, degrading, defam-ing, negatively stereotyping or inciting hatred, discrimination or violence againstpeople in virtue of their race, ethnicity, nationality, religion, sexual orientation,disability, gender identity”. However, we cannot disregard that Hate Speech canbe also expressed by statements promoting superiority of one group of peopleagainst another, or by expressing stereotypes against a group of people.The goal of a Hate Speech Detection model is, given an input text T , tooutput True , if T contains Hate Speech and False otherwise. Modeling thetask as a binary classiﬁcation problem, the detector is built by learning from atraining set and is subsequently evaluated on unseen data. Speciﬁcally, the inputis transformed to a machine-readable format via a text representation method,which ideally captures and retains informative characteristics in the input text.The representation data is fed to a machine learning algorithm that assigns theinput to one of the two classes, with a certain conﬁdence. During the training study of text representations for Hate Speech Detection 3 phase, this discrimination information is used to construct the classiﬁer. Theclassiﬁer is then applied on data not encountered during training, in order tomeasure its generalization ability.In this study, we focus on user-generated texts from social media platforms,speciﬁcally Twitter posts. We evaluate the performance of several establishedtext representations (e.g. Bag of words, word embeddings) and classiﬁcationalgorithms. We also investigate the contribution of the graph-based the n-gramgraph features to the Hate Speech classiﬁcation process. Moreover, we examinewhether a combination of deep features (such as n-gram graphs) and shallowfeatures (such as Bag of Words) can provide top performance in the Hate Speechdetection task.

In this section, we provide a short review of the related work, not only for HateSpeech detection, but for similar tasks as well. Examples of such tasks can befound in [18] where the authors aim to identify which users express Hate Speechmore often, while [32] detect and delete hateful content in a comment, makingsure what is left has correct syntax. The latter is a demanding task which requiresthe precise identiﬁcation of grammatical relations and typed dependencies amongwords of a sentence. Their proposed method results have 90 .

94% agreement withthe manual ﬁltering results.Automatic Hate Speech detection is usually modeled as a Binary Classiﬁca-tion. However, multi-class classiﬁcation can be applied to identify the speciﬁckind of Hate Speech (e.g. racism, sexism etc) [1,21]. One other useful task is thedetection of the speciﬁc words or phrases that are oﬀensive or promote hatred,investigated in [28].

In this work we focus on representations, i.e. the mapping of written humanlanguage into a collection of useful features in a form that is understandableby a computer and, by extension, a Hate Speech Detection model. Below weoverview a number of diﬀerent representations used within this domain.A very popular representation approach is the Bag of Words (BOW) [13,2,1]model, a Vector Space Model extensively used in Natural Language Processingand document classiﬁcation. In BOW, the text is segmented to words, followedby the construction of a histogram of (possible weighted) word frequencies. SinceBOW discards word order, syntactic, semantic and grammatical information, itis commonly used as a baseline in NLP tasks. An extension of the BOW isthe Bag of N-grams [19,13,18,5,29], which replaces the unit of interest in BOWfrom words to n contiguous tokens. A token is usually a word or a character inthe text, giving rise to word n-gram and character n-gram models. Due to thecontiguity consideration, n-gram bags retain local spacial and order information. Ch. Themeli et al.

The authors in [5] claim that lexicon detection methods alone are inade-quate in distinguishing between Hate Speech and Oﬀensive Language, counter-proposing n-gram bags with TF-IDF weighting along with a sentiment lexicon,classiﬁed with L2 regularized Logistic Regression [16]. On the other hand, [1] usecharacter n-grams, BOW and TF-IDF features as a baseline, proposing word em-beddings from GloVe . In [21] the authors use character and word CNNs as wella hybrid CNN model to classify sexist and racist Twitter content. They com-pare multi-class detection with a coarse-to-ﬁne two-step classiﬁcation process,achieving similar results with both approaches. There is also a variety of otherfeatures used such as word or paragraph embeddings ([7], [28], [1]), LDA andBrown Clustering ([24], [31], [28], [29]), sentiment analysis([12], [5]), lexicons anddictionaries ([12], [26], [6] etc) and POS tags([19], [32], [24] etc). Regarding classiﬁcation algorithms, SVM [4], Logistic Regression (LR) and NaiveBayes (NB) are the most widely used (e.g. [28,24,5,7] etc). In [30] and [31], theauthors use a bootstrapping approach to aid the training process via data gener-ation. This approach was used as a semi-supervised learning process to generateadditional data automatically or create hatred lexical resources. The authorsof [31] use the Map-Reduce framework in Hadoop to collect tweets automati-cally from users that are known to use oﬀensive language, and a bootstrappingmethod to extract topics from tweets.Other algorithms used are Decision Trees and Random Forests (RF) ([5,2,31]),while [1] and [6] have used Deep Learning approaches via LSTM networks. Specif-ically, [1] use CNN, LSTM and FastText, i.e. a model that is represented by aver-age word vectors similar to BOW, which are updated through backpropagation.The LSTM model achieved the best performance with 0 .

93 F-Measure, used totrain a GBDT (Gradient Boosted Decision Trees) classiﬁer. In [5], the authorsuse several classiﬁcation algorithms such as regularized LR, NB, Decision Trees,RF and Linear SVM, with L2-regularized LR outperforming other approachesin terms of F-score.For more information, the survey of [25] provides a detailed analysis of de-tector components used for Hate Speech detection and similar tasks.

In this section we will describe the text representations and classiﬁcation com-ponents used in our implementations of a Hate Speech Detection pipeline. Wehave used a variety of diﬀerent text representations, i.e. bag of words, embed-dings, n-grams and n-gram graphs and tested these representations with multipleclassiﬁcation algorithms. We have implemented the feature extraction in Java https://nlp.stanford.edu/projects/glove/ study of text representations for Hate Speech Detection 5 and used both Weka and scikit-learn (sklearn) to implement classiﬁcation algo-rithms. For artiﬁcial neural networks (ANNs), we have used sklearn and Kerasframeworks. Our model can be found in our GitHub repository . In order to discard noise and useless artifacts we apply standard preprocess-ing to each tweet. First, we remove all URLs, mentions (e.g. @username), RT(Retweets) and hashtags (e.g. words starting with .After preprocessing, we apply a variety of representations, starting with theBag of Words (BOW) model. This representation results in a high dimensionalvector, containing all encountered words, requiring a signiﬁcant amount of timein order to process each text. In order to reduce time and space complexity, welimit the number of words of interest to keywords from HateBase [5].Moreover, we have used additional bag models, with respect to word andcharacter n-grams. In order to guarantee a common bag feature vector dimensionacross texts, we pre-compute all n-grams that appear in the dataset, resulting ina sparse and high-dimensional vector. Similarly to the BOW features, in orderto reduce time and space complexity, it is necessary to reduce the vector space.Therefore, we keep only the 100 most frequent n-grams features, discarding therest. Unfortunately, as we will illustrate in the experiments, this decision resultedin highly sparse vectors and, thus, reduced the eﬃciency of those features.Furthermore, we have used GloVe word embeddings [22] to represent thewords of each tweet, mapping each word to a 50-dimensional real vector and ar-riving at a single tweet vector representation via mean averaging. Words missingfrom the GloVe mapping were discarded.Expanding the use of n-grams, we examine wether n-gram graphs (NGGs)[9,11] can have a signiﬁcant contribution in detecting Hate Speech. NGGs are agraph-based text representation method that captures both frequency and localcontext information from text n-grams (as opposed to frequency-only statisticsthat bag models aggregate). This enables NGGs to diﬀerentiate between mor-phologically similar but semantically diﬀerent words, since the information keptis not only the speciﬁc n-gram but also its context (neighboring n-grams). Thegraph is constructed with n-grams as nodes and local co-occurence informationembedded in the edge weights, with comparisons deﬁned via graph-based simi-larity measures [9]. NGGs can operate with word or character n-grams – in thiswork we employ the latter version, which has been known to be resilient to socialmedia text noise [20,11].During training, we construct a representative category graph (RCG) foreach category in the problem (e.g. “Hate Speech” or “Clean”), aggregating all https://github.com/cthem/hate-speech-detection https://github.com/igorbrigadir/stopwords https://github.com/t-davidson/hate-speech-and-offensive-language Ch. Themeli et al. training instances per category to a single NGG. We then compare the NGGof each instance to each RCG, extracting a score expressing the degree thatthe instance belongs that class – for this, we use the NVS measure [9], whichproduces a similarity score between the instance and category NGGs. After thisprocess completes, we end up with similarity-based, n -dimensional model vectorfeatures for each instance – where n is the number of possible classes. We notethat we use 90% of the training instances to build the RCGs, in order to avoidoverﬁtting of our model: in short, using all training instances would result in veryhigh instance-RCG similarities during training. Since we use the resulting modelvectors as inputs to a classiﬁcation phase in the next step, the above approachwould introduce extreme overﬁt to the classiﬁer, biasing it towards expectingperfect similarity scores in cases of an instance belonging to a class, a scenariowhich of course rarely – if ever – happens with real world data.In addition, we produce sentiment, syntax and spelling features. Sentimentanalysis could be a meaningful feature, since hatred is related with a negativepolarity. For sentiment and syntax feature extraction we use the Stanford NLPParser . This tool performs sentiment extraction of the longest phrase tracked inthe input and additionally can be used to provide a syntactic score with syntaxtrees, corresponding the best attained score for the entire tweet.Finally, a spelling feature was constructed to examine whether Hate Speech iscorrelated to the user’s proﬁciency in writing. We have used an English dictionaryto collect all English words with correct spelling and, then, for each word in atweet, we have calculated its edit distance from each word in the dictionary,keeping the smallest value (i.e. the distance from the best match). The ﬁnalfeature kept was the average edit distance for the entire post, with its value beingclose to 0 for tweets with the majority of words correctly spelled. At the end ofthis process, we obtain a 3-dimensional vector, each coordinate corresponding tothe sentiment, syntax and spelling scores of the text. Generated features are fed to a classiﬁer that decides the presence of Hate Speechcontent. We use a variety of classiﬁcation models, as outlined below.Naive Bayes (NB) [23] is a simple probabilistic classiﬁer, based on Bayesianstatistics. NB makes the strong assumption that instance features are indepen-dent from one another, but yields performance comparable to far more compli-cated classiﬁers – this is why it commonly serves as baseline for various machinelearning tasks [14]. Additionally, the independence assumption simpliﬁes thelearning process, reducing it to the model learning the attributes separately,vastly reducing time complexity on large datasets.Logistic Regression (LR) [17] is another statistical model commonly appliedas a baseline in binary classiﬁcation tasks. It produces a prediction via a linearcombination of the input with a set of weights, passed through a logistic func-tion which squeezes scores in the range between 0 and 1, i.e. thus producing https://nlp.stanford.edu/software/lex-parser.html study of text representations for Hate Speech Detection 7 binary classiﬁcation labels. Training the model involves discovering optimal val-ues for the weights, usually acquired through a maximum likelihood estimationoptimization process.The K-Nearest Neighbor (KNN) classiﬁer [8] is another popular techniqueapplied to classiﬁcation. It is a lazy and non-parametric method; no explicittraining and generalization is performed prior to a query to the classiﬁcationsystem, and no assumption is made pertaining to the probability distributionthat the data follows. Inference requires a deﬁned distance measure for comparingtwo instances, via which closest neighbors are extracted. The labels of theseneighbors determine, through voting, the predicted label of a given instance.The Random Forest (RF) [15] is an ensemble learning technique used for bothclassiﬁcation and regression tasks. It combines multiple decision trees during thetraining phase by bootstrap-aggregated ensemble learning, aiming to alleviatenoise and overﬁtting by incorporating multiple weak learners. Compared to de-cision trees, RF produce a split when a subset of the best predictors is randomlyselected from the ensemble.Artiﬁcial Neural Networks (ANNs) are computational graphs inspired by thebiological nervous systems. They are composed of a large number of highly inter-connected neurons, usually organized in layers in a feed-forward directed acyclicgraph. Similarly to a LR unit, neurons compute the linear combination of theirinput (including a bias term) and pass the result through a non-linear activationfunction. Aggregated into an ANN, each neuron computes a speciﬁc feature fromits input, as dictated by the values of the weights and bias. ANNs are trainedwith respect to a loss function, which deﬁnes an error gradient by which all pa-rameters of the ANN are shifted. With each optimization step, the model movestowards an optimum parameter conﬁguration. The gradient with respect to allnetwork parameters is computed by the back-propagation method.In our case,we have used an ANN composed of 3 hidden layers with dropout regularization. In this section, we present the experimental setting used to answer the following: – Which features have the best performance? – Does feature combination improve performance? – Do NGGs have signiﬁcant / comparable performance to BOW or word em-beddings despite being represented by low dimensional vectors? – Are there classiﬁers performing statistically signiﬁcantly better than others?Is the selection of features or classiﬁers more signiﬁcant, when determiningthe pipeline for Hate Speech detection?In the following paragraphs, we elaborate on the the datasets utilized, presentexperimental and statistical signiﬁcance results, as well as discuss of our ﬁndings.

Ch. Themeli et al.

We use the datasets provided by [30] and [5] . We will refer to the ﬁrst datasetas RS (racism and sexism detection) and to the second as HSOL (distinguishHate Speech from Oﬀensive Language). In both works, the authors performa multi-class classiﬁcation task against the corpora. In [30], their goal is todistinguish diﬀerent kinds of Hate Speech, i.e. racism and sexism, and thereforethe possible classes in RS are Racist , Sexist or None . In [5], the annotatedclasses are

Hate Speech , Offensive Language or Clean .Given the multi-class nature of these datasets, we combine them into a singledataset, keeping only instances labeled

Hate Speech and

Clean in the original.We use the combined (RS + HSOL) dataset to evaluate our model implementa-tions on the binary classiﬁcation task. Furthermore, we run multi-class experi-ments on the original datasets for completeness, the results of which are omitteddue to space limitations, but are available upon request.We perform three stages of experiments. First, we run a preliminary evalua-tion on each feature separately, to assess its performance. Secondly, we evaluatethe performance of concatenated feature vectors, in three diﬀerent combinations:1) the top individually performing features by a signiﬁcant margin ( best ), 2)all features all and 3) vector-based features ( vector ), i.e. excluding NGGs. Viathe latter two scenarios, we investigate whether NGGs can achieve comparableperformance to vector-based features of much higher dimensionality.Given the imbalanced dataset used (24463 Hate Speech and 14548 cleansamples), we report performance in both macro and micro F-measure. Finally, weevaluate (with statistical signiﬁcance testing) the performance diﬀerence betweenrun components, through a series of ANOVA and Tukey HSD test evaluations.

Here we provide the main experimental results of our described in the previ-ous section, presented in micro/macro F-measure scores. More detailed results,including multi-class classiﬁcation are omitted due to space limitations but areavailable upon request.Firstly, to answer the question on the value of diﬀerent feature types, weperform individual runs which designate BOW, glove embeddings and NGGas the top performers, with the remaining features (namely sentiment, spelling/ syntax analysis and n-grams) performing signiﬁcantly worse. All approacheshowever surpass a baseline performance in terms of a naive majority-class clas-siﬁer (scoring 0 . / . . https://github.com/ZeerakW/hatespeech https://github.com/t-davidson/hate-speech-and-offensive-language study of text representations for Hate Speech Detection 9 and spelling with NNs in terms of macro F-measure (0 . . / .

603 with NN classiﬁcation and the best word n-grammodel 0 . / .

627 with KNN and NN classiﬁers.The results of the top individually performing features, in terms of micro /macro average F-Measure, are presented in the left half of table 1.

Bold valuesrepresent column-wise maxima, while underlined ones depict maxima in the leftcolumn category (e.g. feature type, in this case). “NN ke” and “NN sk” rep-resent the keras and sklearn neural network implementations, repsectively. Wecan observe that the best performer is BOW with either LR or NNs, followedby word embeddings with NN classiﬁcation. NGGs have a slightly worse perfor-mance, which can be attributed to the severely shorter (2D) feature vector itutilizes. On the other hand, BOW features are 1000-dimensional vectors. Com-pared to NGGs, this corresponds to a 500-fold dimension increase, with a 9 . best combination that involves NGG, BOWand GloVe features is, not surprisingly, the top performer, with LR and NN-sklearn obtaining the best performance. The all conﬁguration follows with NBachieving macro/micro F-scores of 0 .

795 and 0 .

792 respectively. This shows thatthe additional features introduced signiﬁcant amounts of noise, enough to reduceperformance by canceling out any potential information the extra features mighthave provided. Finally, the vector combination achieves the worst performance:0 .

787 and 0 .

783 in macro/micro F-measure. This is testament to the added valueNGGs contribute to the feature pool, reinforced by the individual scores of theother vector-based approaches.Apart from experiments in the binary Hate Speech classiﬁcation on the com-bined dataset, we have tested our classiﬁcation models in multi-class classiﬁ-cation, using the original RS and HSOL datasets. In RS, our best score wasachieved with the all combination and the RF classiﬁer with a micro F-Measureof 0 . . best feature combination and the LR classiﬁer. In table 2 we present ANOVA results with respect to feature extractors andclassiﬁers, under macro and micro F-measure scores. For both metrics, the selec-tion of both features and classiﬁers is statistically signiﬁcant with a conﬁdencelevel greater than 99 . Table 1.

Average micro & macro F-Measure for NGG, BOW and GloVe features (left)and the “best”, “vector” and “all” feature combinations (right).feature classiﬁer macrof microf combo classiﬁers macrof microfNGG KNN 0.712 0.736 best KNN 0.810 0.820LR 0.712 0.739 LR

NB 0.678 0.713 NB 0.632 0.667NN ke 0.718 0.727 NN ke 0.807 0.819NN sk 0.716 0.740 NN sk

RF 0.699 0.726 RF 0.734 0.759BOW KNN 0.787 0.763 all KNN 0.497 0.569LR

NN sk 0.669 0.675RF 0.731 0.755 RF 0.727 0.742 diﬀerent group as a letter. In the upper part we present results between featurecombination groups (“a” to “d”), where the best combination is signiﬁcantlydiﬀerent by the similar all and vector combinations by a large margin, as ex-pected. The middle part compares individual features (groupped from “a” to“g”), where GloVe, BoW and NGGs are assigned to neighbouring groups andarise the most signiﬁcant features, with the other approaches having a large sig-niﬁcance margin from them. Spelling and syntax features are grouped together,as well as the n-gram approaches. Finally, the lower part of the table examinesclassiﬁer groups (“a” to “c”). Here LR leads the ranking, followed by groupswith the ANNs approaches, the NB and RF, and the KNN method.

Table 2.

ANOVA results with repect to feature and classiﬁer selection, in terms ofmacro F-measure (left) and micro-Fmeasure (right).parameter Pr( > F) (macrof) Pr( > F) (microf)features < < classiﬁers The results and statistical tests on our work showcase the BOW, GloVe embed-dings and the NGG model as the top performing feature-related conﬁgurations. study of text representations for Hate Speech Detection 11

BOW and GloVe score best in terms of micro and macro F-measure respectively,with NGG close behind, despite the extreme dimensionality reduction incurredby the model vector representation of graph similarities. The combination of thetop performing features improves the results over individual ones, with 0.831micro F-Measure when employed on an LR classiﬁer or NN-sklearn.

Table 3.

Tukey’s HSD group test on micro F-Measure between feature combinationgroups (top), individual features (middle) and classiﬁers (bottom).conﬁg micro F-measure groupsbest 0.787 aall 0.695 cdvector 0.693 dglove 0.767 aBoW 0.755 abNGG 0.730 bcspelling 0.617 esyntax 0.613 ec-ngrams 0.574 fw-ngrams 0.572 fsentiment 0.500 gLR 0.689 aNN ke 0.670 abNN sk 0.668 abNB 0.661 bcRF 0.655 bcKNN 0.639 c

Regarding classiﬁcation methods, the LR and ANN classiﬁers perform bestwhen used with our top performing features (separately or combined). Statisticaltests show that in both micro and macro F-Measure terms, both representationand classiﬁcation approaches have a signiﬁcant role in the performance results.Finally, we understand from our study that the contribution of NGGs asa text representation is signiﬁcant. NGGs do not use domain-speciﬁc knowl-edge (unlike the BOW vectors which use HateBase keywords) nor require priortraining on large document collections (unlike word embeddings, which needextensive unsupervised pre-training). In addition, the vector dimension of theNGG-based approach is equal to the number of classes, as opposed to the 1000and 50-dimensional BOW and embedding vectors, respectively. Despite this lowdimensional representation, our empirical evaluation shows that NGGs have asigniﬁcant contribution to detection performance. Therefore NGGs can be seenas oﬀ-the-shelf rich features that encapsulate useful information in a low dimen-sional representation, which helps achieve signiﬁcant performance either whenused by itself or in feature combination approaches.

In this study, we investigated diﬀerent text representation techniques and clas-siﬁcation algorithms, performing a large number of experimental evaluations onthe Hate Speech detection problem. We showed that n-gram graph-based fea-tures constitute deep/rich features, with signiﬁcant contribution to the HateSpeech classiﬁcation results.In the future, we aim to better evaluate the contribution of word roles (e.g.POS tags) and combine them with improved preprocessing, to avoid possiblenoise in the related features. Concerning NGGs in Hate Speech detection, wewant to apply the ﬁndings from the work of [27] on NGG variations, to representshort texts with only the important n-grams of the text (e.g. through a TF-IDFﬁltering process and/or a named entity recognizer). The aim is to reduce thecomplexity and size of the NGGs, while retaining all the useful information.Another avenue of research, is the enrichment of deep features with statisticalpre-trained models (such as Latent Dirichlet Allocation) or semantic information(e.g. from thesauri) to further improve performance.

References

1. Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speechdetection in tweets. In: Proceedings of the 26th International Conference on WorldWide Web Companion. pp. 759–760. International World Wide Web ConferencesSteering Committee (2017)2. Bourgonje, P., Moreno-Schneider, J., Srivastava, A., Rehm, G.: Automatic classiﬁ-cation of abusive language and personal attacks in various forms of online commu-nication. In: International Conference of the German Society for ComputationalLinguistics and Language Technology. pp. 180–191. Springer (2017)3. Brown, A.: What is hate speech? part 1: The myth of hate. Law and Philosophy (4), 419–468 (2017)4. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning (3), 273–297(1995)5. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detectionand the problem of oﬀensive language. arXiv preprint arXiv:1703.04009 (2017)6. Del Vigna12, F., Cimino23, A., Dell’Orletta, F., Petrocchi, M., Tesconi, M.: Hateme, hate me not: Hate speech detection on facebook (2017)7. Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., Bhamidipati,N.: Hate speech detection with comment embeddings. In: Proceedings of the 24thinternational conference on world wide web. pp. 29–30. ACM (2015)8. Fix, E., Hodges Jr, J.L.: Discriminatory analysis-nonparametric discrimination:Small sample performance. Tech. rep., CALIFORNIA UNIV BERKELEY (1952)9. Giannakopoulos, G.: Automatic Summarization from Multi-ple Documents. Ph.D. thesis, University of the Aegean (2009),

10. Giannakopoulos, G., Karkaletsis, V., Vouros, G.A.: Testing the use of n-gramgraphs in summarization sub-tasks. In: TAC (2008) study of text representations for Hate Speech Detection 1311. Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., Tserpes, K.: Rep-resentation models for text classiﬁcation: a comparative analysis over three webdocument types. In: Proceedings of the 2nd international conference on web intel-ligence, mining and semantics. p. 13. ACM (2012)12. Gitari, N.D., Zuping, Z., Damien, H., Long, J.: A lexicon-based approach for hatespeech detection. International Journal of Multimedia and Ubiquitous Engineering (4), 215–230 (2015)13. Kwok, I., Wang, Y.: Locate the hate: Detecting tweets against blacks. In: AAAI(2013)14. Lewis, D.D.: Naive (bayes) at forty: The independence assumption in informationretrieval. In: European conference on machine learning. pp. 4–15. Springer (1998)15. Liaw, A., Wiener, M., et al.: Classiﬁcation and regression by randomforest. R news (3), 18–22 (2002)16. McCullagh, P., Nelder, J.A.: Generalized linear models, vol. 37. CRC press (1989)17. Menard, S.W.: Applied logistic regression analysis. No. 04; e-book. (1995)18. Mubarak, H., Darwish, K., Magdy, W.: Abusive language detection on arabic socialmedia. In: Proceedings of the First Workshop on Abusive Language Online. pp.52–56 (2017)19. Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive languagedetection in online user content. In: Proceedings of the 25th international confer-ence on world wide web. pp. 145–153. International World Wide Web ConferencesSteering Committee (2016)20. Papadakis, G., Giannakopoulos, G., Paliouras, G.: Graph vs. bag representationmodels for the topic classiﬁcation of web documents. World Wide Web (5),887–920 (2016)21. Park, J.H., Fung, P.: One-step and two-step classiﬁcation for abusive languagedetection on twitter. arXiv preprint arXiv:1706.01206 (2017)22. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre-sentation. In: Proceedings of the 2014 conference on empirical methods in naturallanguage processing (EMNLP). pp. 1532–1543 (2014)23. Russell, S., Norvig, P., Intelligence, A.: A modern approach. Artiﬁcial Intelligence.Prentice-Hall, Egnlewood Cliﬀs25