A study of text representations in Hate Speech Detection
Chrysoula Themeli, George Giannakopoulos, Nikiforos Pittaras
aa r X i v : . [ c s . C L ] F e b A study of text representations for Hate SpeechDetection ⋆ Chrysoula Themeli , George Giannakopoulos , and Nikiforos Pittaras , Department of Informatics and Telecommunications,National and Kapodistrian University of Athens { cthemeli, npittaras } @di.uoa.gr NCSR Demokritos, Athens, Greece { ggianna, pittarasnikif } @iit.demokritos.gr Abstract.
The pervasiveness of the Internet and social media have en-abled the rapid and anonymous spread of Hate Speech content on mi-croblogging platforms such as Twitter. Current EU and US legislationagainst hateful language, in conjunction with the large amount of dataproduced in these platforms has led to automatic tools being a neces-sary component of the Hate Speech detection task and pipeline. In thisstudy, we examine the performance of several, diverse text representa-tion techniques paired with multiple classification algorithms, on theautomatic Hate Speech detection and abusive language discriminationtask. We perform an experimental evaluation on binary and multiclassdatasets, paired with significance testing. Our results show that sim-ple hate-keyword frequency features (BoW) work best, followed by pre-trained word embeddings (GLoVe) as well as N-gram graphs (NGGs): agraph-based representation which proved to produce efficient, very low-dimensional but rich features for this task. A combination of these rep-resentations paired with Logistic Regression or 3-layer neural networkclassifiers achieved the best detection performance, in terms of microand macro F-measure.
Keywords: hate speech · natural language processing · classification · social media Hate Speech is a common affliction in modern society. Nowadays, people cancome across Hate Speech content even more easily through social media plat-forms, websites and forums containing user-created content. The increase of theuse of social media gives individuals the opportunity to easily spread hateful con-tent and reach a number of people larger than ever before. On the other hand,social media platforms like Facebook or Twitter want to both comply with leg-islation against Hate Speech and improve user experience. Therefore, they needto track and remove Hate Speech content from their websites efficiently. ⋆ Supported by NCSR Demokritos, and the Department of Informatics and Telecom-munications, National and Kapodistrian University of Athens Ch. Themeli et al.
Due to the large amount of data transmitted through these platforms, dele-gating such a task to humans is extremely inefficient. A usual compromise is torely on user reports in order to review only the reported posts and comments.This is also ineffective, since it relies on the users’ subjectivity and trustwor-thiness, as well as their ability to thoroughly track and flag such content. Dueto all of the above, the development of automated tools to detect Hate Speechcontent is deemed necessary. The goal of this work is: (i) to study different textrepresentations and classification algorithms in the task of Hate Speech detec-tion; (ii) evaluate whether the n-gram graphs representation [10] can constitutea rich/deep feature set (as e.g. in [20]) for the given task.The structure of the paper is as follows. In section 2 we define the hate speechdetection problem, while in section 3 we discuss related work. We overview ourstudy approach and elaborate on the proposed method in section 4. We thenexperimentally evaluate the performance of different approaches in Section 5,concluding the paper in Section 6, by summarizing the findings and proposingfuture work.
The first step to Hate Speech detection is to provide a clear and concise definitionof Hate Speech. This is important especially during the manual compilation ofHate Speech detection datasets, where human annotators are involved. In theirwork, the authors of [13] have asked three students of different race and same ageand gender to annotate whether a tweet contained Hate Speech or not, as well asthe degree of its offensiveness. The agreement was only 33%, showing that HateSpeech detection can be highly subjective and dependent on the educationaland/or cultural background of the annotator. Thus, an unambiguous definitionis necessary to eliminate any such personal bias in the annotation process.Usually, Hate Speech is associated with insults or threats. Following the def-inition provided by [3], “it covers all forms of expressions that spread, incite,promote or justify racial hatred, xenophobia, antisemitism or other forms ofhatred based on intolerance”. Moreover, it can be “insulting, degrading, defam-ing, negatively stereotyping or inciting hatred, discrimination or violence againstpeople in virtue of their race, ethnicity, nationality, religion, sexual orientation,disability, gender identity”. However, we cannot disregard that Hate Speech canbe also expressed by statements promoting superiority of one group of peopleagainst another, or by expressing stereotypes against a group of people.The goal of a Hate Speech Detection model is, given an input text T , tooutput True , if T contains Hate Speech and False otherwise. Modeling thetask as a binary classification problem, the detector is built by learning from atraining set and is subsequently evaluated on unseen data. Specifically, the inputis transformed to a machine-readable format via a text representation method,which ideally captures and retains informative characteristics in the input text.The representation data is fed to a machine learning algorithm that assigns theinput to one of the two classes, with a certain confidence. During the training study of text representations for Hate Speech Detection 3 phase, this discrimination information is used to construct the classifier. Theclassifier is then applied on data not encountered during training, in order tomeasure its generalization ability.In this study, we focus on user-generated texts from social media platforms,specifically Twitter posts. We evaluate the performance of several establishedtext representations (e.g. Bag of words, word embeddings) and classificationalgorithms. We also investigate the contribution of the graph-based the n-gramgraph features to the Hate Speech classification process. Moreover, we examinewhether a combination of deep features (such as n-gram graphs) and shallowfeatures (such as Bag of Words) can provide top performance in the Hate Speechdetection task.
In this section, we provide a short review of the related work, not only for HateSpeech detection, but for similar tasks as well. Examples of such tasks can befound in [18] where the authors aim to identify which users express Hate Speechmore often, while [32] detect and delete hateful content in a comment, makingsure what is left has correct syntax. The latter is a demanding task which requiresthe precise identification of grammatical relations and typed dependencies amongwords of a sentence. Their proposed method results have 90 .
94% agreement withthe manual filtering results.Automatic Hate Speech detection is usually modeled as a Binary Classifica-tion. However, multi-class classification can be applied to identify the specifickind of Hate Speech (e.g. racism, sexism etc) [1,21]. One other useful task is thedetection of the specific words or phrases that are offensive or promote hatred,investigated in [28].
In this work we focus on representations, i.e. the mapping of written humanlanguage into a collection of useful features in a form that is understandableby a computer and, by extension, a Hate Speech Detection model. Below weoverview a number of different representations used within this domain.A very popular representation approach is the Bag of Words (BOW) [13,2,1]model, a Vector Space Model extensively used in Natural Language Processingand document classification. In BOW, the text is segmented to words, followedby the construction of a histogram of (possible weighted) word frequencies. SinceBOW discards word order, syntactic, semantic and grammatical information, itis commonly used as a baseline in NLP tasks. An extension of the BOW isthe Bag of N-grams [19,13,18,5,29], which replaces the unit of interest in BOWfrom words to n contiguous tokens. A token is usually a word or a character inthe text, giving rise to word n-gram and character n-gram models. Due to thecontiguity consideration, n-gram bags retain local spacial and order information. Ch. Themeli et al.
The authors in [5] claim that lexicon detection methods alone are inade-quate in distinguishing between Hate Speech and Offensive Language, counter-proposing n-gram bags with TF-IDF weighting along with a sentiment lexicon,classified with L2 regularized Logistic Regression [16]. On the other hand, [1] usecharacter n-grams, BOW and TF-IDF features as a baseline, proposing word em-beddings from GloVe . In [21] the authors use character and word CNNs as wella hybrid CNN model to classify sexist and racist Twitter content. They com-pare multi-class detection with a coarse-to-fine two-step classification process,achieving similar results with both approaches. There is also a variety of otherfeatures used such as word or paragraph embeddings ([7], [28], [1]), LDA andBrown Clustering ([24], [31], [28], [29]), sentiment analysis([12], [5]), lexicons anddictionaries ([12], [26], [6] etc) and POS tags([19], [32], [24] etc). Regarding classification algorithms, SVM [4], Logistic Regression (LR) and NaiveBayes (NB) are the most widely used (e.g. [28,24,5,7] etc). In [30] and [31], theauthors use a bootstrapping approach to aid the training process via data gener-ation. This approach was used as a semi-supervised learning process to generateadditional data automatically or create hatred lexical resources. The authorsof [31] use the Map-Reduce framework in Hadoop to collect tweets automati-cally from users that are known to use offensive language, and a bootstrappingmethod to extract topics from tweets.Other algorithms used are Decision Trees and Random Forests (RF) ([5,2,31]),while [1] and [6] have used Deep Learning approaches via LSTM networks. Specif-ically, [1] use CNN, LSTM and FastText, i.e. a model that is represented by aver-age word vectors similar to BOW, which are updated through backpropagation.The LSTM model achieved the best performance with 0 .
93 F-Measure, used totrain a GBDT (Gradient Boosted Decision Trees) classifier. In [5], the authorsuse several classification algorithms such as regularized LR, NB, Decision Trees,RF and Linear SVM, with L2-regularized LR outperforming other approachesin terms of F-score.For more information, the survey of [25] provides a detailed analysis of de-tector components used for Hate Speech detection and similar tasks.
In this section we will describe the text representations and classification com-ponents used in our implementations of a Hate Speech Detection pipeline. Wehave used a variety of different text representations, i.e. bag of words, embed-dings, n-grams and n-gram graphs and tested these representations with multipleclassification algorithms. We have implemented the feature extraction in Java https://nlp.stanford.edu/projects/glove/ study of text representations for Hate Speech Detection 5 and used both Weka and scikit-learn (sklearn) to implement classification algo-rithms. For artificial neural networks (ANNs), we have used sklearn and Kerasframeworks. Our model can be found in our GitHub repository . In order to discard noise and useless artifacts we apply standard preprocess-ing to each tweet. First, we remove all URLs, mentions (e.g. @username), RT(Retweets) and hashtags (e.g. words starting with .After preprocessing, we apply a variety of representations, starting with theBag of Words (BOW) model. This representation results in a high dimensionalvector, containing all encountered words, requiring a significant amount of timein order to process each text. In order to reduce time and space complexity, welimit the number of words of interest to keywords from HateBase [5].Moreover, we have used additional bag models, with respect to word andcharacter n-grams. In order to guarantee a common bag feature vector dimensionacross texts, we pre-compute all n-grams that appear in the dataset, resulting ina sparse and high-dimensional vector. Similarly to the BOW features, in orderto reduce time and space complexity, it is necessary to reduce the vector space.Therefore, we keep only the 100 most frequent n-grams features, discarding therest. Unfortunately, as we will illustrate in the experiments, this decision resultedin highly sparse vectors and, thus, reduced the efficiency of those features.Furthermore, we have used GloVe word embeddings [22] to represent thewords of each tweet, mapping each word to a 50-dimensional real vector and ar-riving at a single tweet vector representation via mean averaging. Words missingfrom the GloVe mapping were discarded.Expanding the use of n-grams, we examine wether n-gram graphs (NGGs)[9,11] can have a significant contribution in detecting Hate Speech. NGGs are agraph-based text representation method that captures both frequency and localcontext information from text n-grams (as opposed to frequency-only statisticsthat bag models aggregate). This enables NGGs to differentiate between mor-phologically similar but semantically different words, since the information keptis not only the specific n-gram but also its context (neighboring n-grams). Thegraph is constructed with n-grams as nodes and local co-occurence informationembedded in the edge weights, with comparisons defined via graph-based simi-larity measures [9]. NGGs can operate with word or character n-grams – in thiswork we employ the latter version, which has been known to be resilient to socialmedia text noise [20,11].During training, we construct a representative category graph (RCG) foreach category in the problem (e.g. “Hate Speech” or “Clean”), aggregating all https://github.com/cthem/hate-speech-detection https://github.com/igorbrigadir/stopwords https://github.com/t-davidson/hate-speech-and-offensive-language Ch. Themeli et al. training instances per category to a single NGG. We then compare the NGGof each instance to each RCG, extracting a score expressing the degree thatthe instance belongs that class – for this, we use the NVS measure [9], whichproduces a similarity score between the instance and category NGGs. After thisprocess completes, we end up with similarity-based, n -dimensional model vectorfeatures for each instance – where n is the number of possible classes. We notethat we use 90% of the training instances to build the RCGs, in order to avoidoverfitting of our model: in short, using all training instances would result in veryhigh instance-RCG similarities during training. Since we use the resulting modelvectors as inputs to a classification phase in the next step, the above approachwould introduce extreme overfit to the classifier, biasing it towards expectingperfect similarity scores in cases of an instance belonging to a class, a scenariowhich of course rarely – if ever – happens with real world data.In addition, we produce sentiment, syntax and spelling features. Sentimentanalysis could be a meaningful feature, since hatred is related with a negativepolarity. For sentiment and syntax feature extraction we use the Stanford NLPParser . This tool performs sentiment extraction of the longest phrase tracked inthe input and additionally can be used to provide a syntactic score with syntaxtrees, corresponding the best attained score for the entire tweet.Finally, a spelling feature was constructed to examine whether Hate Speech iscorrelated to the user’s proficiency in writing. We have used an English dictionaryto collect all English words with correct spelling and, then, for each word in atweet, we have calculated its edit distance from each word in the dictionary,keeping the smallest value (i.e. the distance from the best match). The finalfeature kept was the average edit distance for the entire post, with its value beingclose to 0 for tweets with the majority of words correctly spelled. At the end ofthis process, we obtain a 3-dimensional vector, each coordinate corresponding tothe sentiment, syntax and spelling scores of the text. Generated features are fed to a classifier that decides the presence of Hate Speechcontent. We use a variety of classification models, as outlined below.Naive Bayes (NB) [23] is a simple probabilistic classifier, based on Bayesianstatistics. NB makes the strong assumption that instance features are indepen-dent from one another, but yields performance comparable to far more compli-cated classifiers – this is why it commonly serves as baseline for various machinelearning tasks [14]. Additionally, the independence assumption simplifies thelearning process, reducing it to the model learning the attributes separately,vastly reducing time complexity on large datasets.Logistic Regression (LR) [17] is another statistical model commonly appliedas a baseline in binary classification tasks. It produces a prediction via a linearcombination of the input with a set of weights, passed through a logistic func-tion which squeezes scores in the range between 0 and 1, i.e. thus producing https://nlp.stanford.edu/software/lex-parser.html study of text representations for Hate Speech Detection 7 binary classification labels. Training the model involves discovering optimal val-ues for the weights, usually acquired through a maximum likelihood estimationoptimization process.The K-Nearest Neighbor (KNN) classifier [8] is another popular techniqueapplied to classification. It is a lazy and non-parametric method; no explicittraining and generalization is performed prior to a query to the classificationsystem, and no assumption is made pertaining to the probability distributionthat the data follows. Inference requires a defined distance measure for comparingtwo instances, via which closest neighbors are extracted. The labels of theseneighbors determine, through voting, the predicted label of a given instance.The Random Forest (RF) [15] is an ensemble learning technique used for bothclassification and regression tasks. It combines multiple decision trees during thetraining phase by bootstrap-aggregated ensemble learning, aiming to alleviatenoise and overfitting by incorporating multiple weak learners. Compared to de-cision trees, RF produce a split when a subset of the best predictors is randomlyselected from the ensemble.Artificial Neural Networks (ANNs) are computational graphs inspired by thebiological nervous systems. They are composed of a large number of highly inter-connected neurons, usually organized in layers in a feed-forward directed acyclicgraph. Similarly to a LR unit, neurons compute the linear combination of theirinput (including a bias term) and pass the result through a non-linear activationfunction. Aggregated into an ANN, each neuron computes a specific feature fromits input, as dictated by the values of the weights and bias. ANNs are trainedwith respect to a loss function, which defines an error gradient by which all pa-rameters of the ANN are shifted. With each optimization step, the model movestowards an optimum parameter configuration. The gradient with respect to allnetwork parameters is computed by the back-propagation method.In our case,we have used an ANN composed of 3 hidden layers with dropout regularization. In this section, we present the experimental setting used to answer the following: – Which features have the best performance? – Does feature combination improve performance? – Do NGGs have significant / comparable performance to BOW or word em-beddings despite being represented by low dimensional vectors? – Are there classifiers performing statistically significantly better than others?Is the selection of features or classifiers more significant, when determiningthe pipeline for Hate Speech detection?In the following paragraphs, we elaborate on the the datasets utilized, presentexperimental and statistical significance results, as well as discuss of our findings.
Ch. Themeli et al.
We use the datasets provided by [30] and [5] . We will refer to the first datasetas RS (racism and sexism detection) and to the second as HSOL (distinguishHate Speech from Offensive Language). In both works, the authors performa multi-class classification task against the corpora. In [30], their goal is todistinguish different kinds of Hate Speech, i.e. racism and sexism, and thereforethe possible classes in RS are Racist , Sexist or None . In [5], the annotatedclasses are
Hate Speech , Offensive Language or Clean .Given the multi-class nature of these datasets, we combine them into a singledataset, keeping only instances labeled
Hate Speech and
Clean in the original.We use the combined (RS + HSOL) dataset to evaluate our model implementa-tions on the binary classification task. Furthermore, we run multi-class experi-ments on the original datasets for completeness, the results of which are omitteddue to space limitations, but are available upon request.We perform three stages of experiments. First, we run a preliminary evalua-tion on each feature separately, to assess its performance. Secondly, we evaluatethe performance of concatenated feature vectors, in three different combinations:1) the top individually performing features by a significant margin ( best ), 2)all features all and 3) vector-based features ( vector ), i.e. excluding NGGs. Viathe latter two scenarios, we investigate whether NGGs can achieve comparableperformance to vector-based features of much higher dimensionality.Given the imbalanced dataset used (24463 Hate Speech and 14548 cleansamples), we report performance in both macro and micro F-measure. Finally, weevaluate (with statistical significance testing) the performance difference betweenrun components, through a series of ANOVA and Tukey HSD test evaluations.
Here we provide the main experimental results of our described in the previ-ous section, presented in micro/macro F-measure scores. More detailed results,including multi-class classification are omitted due to space limitations but areavailable upon request.Firstly, to answer the question on the value of different feature types, weperform individual runs which designate BOW, glove embeddings and NGGas the top performers, with the remaining features (namely sentiment, spelling/ syntax analysis and n-grams) performing significantly worse. All approacheshowever surpass a baseline performance in terms of a naive majority-class clas-sifier (scoring 0 . / . . https://github.com/ZeerakW/hatespeech https://github.com/t-davidson/hate-speech-and-offensive-language study of text representations for Hate Speech Detection 9 and spelling with NNs in terms of macro F-measure (0 . . / .
603 with NN classification and the best word n-grammodel 0 . / .
627 with KNN and NN classifiers.The results of the top individually performing features, in terms of micro /macro average F-Measure, are presented in the left half of table 1.
Bold valuesrepresent column-wise maxima, while underlined ones depict maxima in the leftcolumn category (e.g. feature type, in this case). “NN ke” and “NN sk” rep-resent the keras and sklearn neural network implementations, repsectively. Wecan observe that the best performer is BOW with either LR or NNs, followedby word embeddings with NN classification. NGGs have a slightly worse perfor-mance, which can be attributed to the severely shorter (2D) feature vector itutilizes. On the other hand, BOW features are 1000-dimensional vectors. Com-pared to NGGs, this corresponds to a 500-fold dimension increase, with a 9 . best combination that involves NGG, BOWand GloVe features is, not surprisingly, the top performer, with LR and NN-sklearn obtaining the best performance. The all configuration follows with NBachieving macro/micro F-scores of 0 .
795 and 0 .
792 respectively. This shows thatthe additional features introduced significant amounts of noise, enough to reduceperformance by canceling out any potential information the extra features mighthave provided. Finally, the vector combination achieves the worst performance:0 .
787 and 0 .
783 in macro/micro F-measure. This is testament to the added valueNGGs contribute to the feature pool, reinforced by the individual scores of theother vector-based approaches.Apart from experiments in the binary Hate Speech classification on the com-bined dataset, we have tested our classification models in multi-class classifi-cation, using the original RS and HSOL datasets. In RS, our best score wasachieved with the all combination and the RF classifier with a micro F-Measureof 0 . . best feature combination and the LR classifier. In table 2 we present ANOVA results with respect to feature extractors andclassifiers, under macro and micro F-measure scores. For both metrics, the selec-tion of both features and classifiers is statistically significant with a confidencelevel greater than 99 . Table 1.
Average micro & macro F-Measure for NGG, BOW and GloVe features (left)and the “best”, “vector” and “all” feature combinations (right).feature classifier macrof microf combo classifiers macrof microfNGG KNN 0.712 0.736 best KNN 0.810 0.820LR 0.712 0.739 LR
NB 0.678 0.713 NB 0.632 0.667NN ke 0.718 0.727 NN ke 0.807 0.819NN sk 0.716 0.740 NN sk
RF 0.699 0.726 RF 0.734 0.759BOW KNN 0.787 0.763 all KNN 0.497 0.569LR
NN sk 0.669 0.675RF 0.731 0.755 RF 0.727 0.742 different group as a letter. In the upper part we present results between featurecombination groups (“a” to “d”), where the best combination is significantlydifferent by the similar all and vector combinations by a large margin, as ex-pected. The middle part compares individual features (groupped from “a” to“g”), where GloVe, BoW and NGGs are assigned to neighbouring groups andarise the most significant features, with the other approaches having a large sig-nificance margin from them. Spelling and syntax features are grouped together,as well as the n-gram approaches. Finally, the lower part of the table examinesclassifier groups (“a” to “c”). Here LR leads the ranking, followed by groupswith the ANNs approaches, the NB and RF, and the KNN method.
Table 2.
ANOVA results with repect to feature and classifier selection, in terms ofmacro F-measure (left) and micro-Fmeasure (right).parameter Pr( > F) (macrof) Pr( > F) (microf)features < < classifiers The results and statistical tests on our work showcase the BOW, GloVe embed-dings and the NGG model as the top performing feature-related configurations. study of text representations for Hate Speech Detection 11
BOW and GloVe score best in terms of micro and macro F-measure respectively,with NGG close behind, despite the extreme dimensionality reduction incurredby the model vector representation of graph similarities. The combination of thetop performing features improves the results over individual ones, with 0.831micro F-Measure when employed on an LR classifier or NN-sklearn.
Table 3.
Tukey’s HSD group test on micro F-Measure between feature combinationgroups (top), individual features (middle) and classifiers (bottom).config micro F-measure groupsbest 0.787 aall 0.695 cdvector 0.693 dglove 0.767 aBoW 0.755 abNGG 0.730 bcspelling 0.617 esyntax 0.613 ec-ngrams 0.574 fw-ngrams 0.572 fsentiment 0.500 gLR 0.689 aNN ke 0.670 abNN sk 0.668 abNB 0.661 bcRF 0.655 bcKNN 0.639 c
Regarding classification methods, the LR and ANN classifiers perform bestwhen used with our top performing features (separately or combined). Statisticaltests show that in both micro and macro F-Measure terms, both representationand classification approaches have a significant role in the performance results.Finally, we understand from our study that the contribution of NGGs asa text representation is significant. NGGs do not use domain-specific knowl-edge (unlike the BOW vectors which use HateBase keywords) nor require priortraining on large document collections (unlike word embeddings, which needextensive unsupervised pre-training). In addition, the vector dimension of theNGG-based approach is equal to the number of classes, as opposed to the 1000and 50-dimensional BOW and embedding vectors, respectively. Despite this lowdimensional representation, our empirical evaluation shows that NGGs have asignificant contribution to detection performance. Therefore NGGs can be seenas off-the-shelf rich features that encapsulate useful information in a low dimen-sional representation, which helps achieve significant performance either whenused by itself or in feature combination approaches.
In this study, we investigated different text representation techniques and clas-sification algorithms, performing a large number of experimental evaluations onthe Hate Speech detection problem. We showed that n-gram graph-based fea-tures constitute deep/rich features, with significant contribution to the HateSpeech classification results.In the future, we aim to better evaluate the contribution of word roles (e.g.POS tags) and combine them with improved preprocessing, to avoid possiblenoise in the related features. Concerning NGGs in Hate Speech detection, wewant to apply the findings from the work of [27] on NGG variations, to representshort texts with only the important n-grams of the text (e.g. through a TF-IDFfiltering process and/or a named entity recognizer). The aim is to reduce thecomplexity and size of the NGGs, while retaining all the useful information.Another avenue of research, is the enrichment of deep features with statisticalpre-trained models (such as Latent Dirichlet Allocation) or semantic information(e.g. from thesauri) to further improve performance.
References
1. Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speechdetection in tweets. In: Proceedings of the 26th International Conference on WorldWide Web Companion. pp. 759–760. International World Wide Web ConferencesSteering Committee (2017)2. Bourgonje, P., Moreno-Schneider, J., Srivastava, A., Rehm, G.: Automatic classifi-cation of abusive language and personal attacks in various forms of online commu-nication. In: International Conference of the German Society for ComputationalLinguistics and Language Technology. pp. 180–191. Springer (2017)3. Brown, A.: What is hate speech? part 1: The myth of hate. Law and Philosophy (4), 419–468 (2017)4. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning (3), 273–297(1995)5. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detectionand the problem of offensive language. arXiv preprint arXiv:1703.04009 (2017)6. Del Vigna12, F., Cimino23, A., Dell’Orletta, F., Petrocchi, M., Tesconi, M.: Hateme, hate me not: Hate speech detection on facebook (2017)7. Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., Bhamidipati,N.: Hate speech detection with comment embeddings. In: Proceedings of the 24thinternational conference on world wide web. pp. 29–30. ACM (2015)8. Fix, E., Hodges Jr, J.L.: Discriminatory analysis-nonparametric discrimination:Small sample performance. Tech. rep., CALIFORNIA UNIV BERKELEY (1952)9. Giannakopoulos, G.: Automatic Summarization from Multi-ple Documents. Ph.D. thesis, University of the Aegean (2009),
10. Giannakopoulos, G., Karkaletsis, V., Vouros, G.A.: Testing the use of n-gramgraphs in summarization sub-tasks. In: TAC (2008) study of text representations for Hate Speech Detection 1311. Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., Tserpes, K.: Rep-resentation models for text classification: a comparative analysis over three webdocument types. In: Proceedings of the 2nd international conference on web intel-ligence, mining and semantics. p. 13. ACM (2012)12. Gitari, N.D., Zuping, Z., Damien, H., Long, J.: A lexicon-based approach for hatespeech detection. International Journal of Multimedia and Ubiquitous Engineering (4), 215–230 (2015)13. Kwok, I., Wang, Y.: Locate the hate: Detecting tweets against blacks. In: AAAI(2013)14. Lewis, D.D.: Naive (bayes) at forty: The independence assumption in informationretrieval. In: European conference on machine learning. pp. 4–15. Springer (1998)15. Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R news (3), 18–22 (2002)16. McCullagh, P., Nelder, J.A.: Generalized linear models, vol. 37. CRC press (1989)17. Menard, S.W.: Applied logistic regression analysis. No. 04; e-book. (1995)18. Mubarak, H., Darwish, K., Magdy, W.: Abusive language detection on arabic socialmedia. In: Proceedings of the First Workshop on Abusive Language Online. pp.52–56 (2017)19. Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive languagedetection in online user content. In: Proceedings of the 25th international confer-ence on world wide web. pp. 145–153. International World Wide Web ConferencesSteering Committee (2016)20. Papadakis, G., Giannakopoulos, G., Paliouras, G.: Graph vs. bag representationmodels for the topic classification of web documents. World Wide Web (5),887–920 (2016)21. Park, J.H., Fung, P.: One-step and two-step classification for abusive languagedetection on twitter. arXiv preprint arXiv:1706.01206 (2017)22. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre-sentation. In: Proceedings of the 2014 conference on empirical methods in naturallanguage processing (EMNLP). pp. 1532–1543 (2014)23. Russell, S., Norvig, P., Intelligence, A.: A modern approach. Artificial Intelligence.Prentice-Hall, Egnlewood Cliffs25