[PDF] Neural Character-based Composition Models for Abuse Detection

Abstract

The advent of social media in recent years has fed into some highly undesirable phenomena such as proliferation of offensive language, hate speech, sexist remarks, etc. on the Internet. In light of this, there have been several efforts to automate the detection and moderation of such abusive content. However, deliberate obfuscation of words by users to evade detection poses a serious challenge to the effectiveness of these efforts. The current state of the art approaches to abusive language detection, based on recurrent neural networks, do not explicitly address this problem and resort to a generic OOV (out of vocabulary) embedding for unseen words. However, in using a single embedding for all unseen words we lose the ability to distinguish between obfuscated and non-obfuscated or rare words. In this paper, we address this problem by designing a model that can compose embeddings for unseen words. We experimentally demonstrate that our approach significantly advances the current state of the art in abuse detection on datasets from two different domains, namely Twitter and Wikipedia talk page.

Full PDF

NNeural Character-based Composition Models for Abuse Detection

Pushkar Mishra

Dept. of CS & TechnologyUniversity of CambridgeUnited Kingdom [email protected]

Helen Yannakoudakis

The ALTA InstituteUniversity of CambridgeUnited Kingdom [email protected]

Ekaterina Shutova

ILLCUniversity of AmsterdamThe Netherlands [email protected]

Abstract

OOV (out of vocabulary) embed-ding for unseen words. However, in using asingle embedding for all unseen words we losethe ability to distinguish between obfuscatedand non-obfuscated or rare words. In this pa-per, we address this problem by designing amodel that can compose embeddings for un-seen words. We experimentally demonstratethat our approach signiﬁcantly advances thecurrent state of the art in abuse detection ondatasets from two different domains, namelyTwitter and Wikipedia talk page.

Pew Research Center has recently uncovered sev-eral disturbing trends in communications on theInternet. As per their report (Duggan, 2014), 40%of adult Internet users have personally experiencedharassment online, and 60% have witnessed theuse of offensive names and expletives. Expect-edly, the majority (66%) of those who have per-sonally faced harassment have had their most re-cent incident occur on a social networking websiteor app. While most of these websites and apps pro-vide ways of ﬂagging offensive and hateful con-tent, only 8.8% of the victims have actually con-sidered using such provisions. Two conclusions can be drawn from these statis-tics: (i) abuse (a term we use henceforth to collec-tively refer to toxic language, hate speech, etc.) isprevalent in social media, and (ii) passive and/ormanual techniques for curbing its propagation(such as ﬂagging) are neither effective nor easilyscalable (Pavlopoulos et al., 2017). Consequently,the efforts to automate the detection and modera-tion of such content have been gaining popularity(Waseem and Hovy, 2016; Wulczyn et al., 2017).In their work, Nobata et al. (2016) describe thetask of achieving effective automation as an inher-ently difﬁcult one due to several ingrained com-plexities; a prominent one they highlight is thedeliberate structural obfuscation of words (for ex-ample, fcukk , w0m3n , banislam , etc.) by users toevade detection. Simple spelling correction tech-niques and edit-distance procedures fail to provideinformation about such obfuscations because: (i)words may be excessively fudged (e.g., a55h0le , n1gg3r ) or concatenated (e.g., stupidbitch , femi-nismishate ), and (ii) they fail to take into accountthe fact that some character sequences like musl and wom are more frequent and more indicative ofabuse than others (Waseem and Hovy, 2016).Nobata et al. (2016) go on to show that sim-ple character n-gram features prove to be highlypromising for supervised classiﬁcation approachesto abuse detection due to their robustness tospelling variations; however, they do not addressobfuscations explicitly. Waseem and Hovy (2016)and Wulczyn et al. (2017) also use character n-grams to attain impressive results on their respec-tive datasets. That said, the current state of theart methods do not exploit character-level infor-mation, but instead utilize recurrent neural net-work ( RNN ) models operating on word embed-dings alone (Pavlopoulos et al., 2017; Badjatiyaet al., 2017). Since the problem of deliberately a r X i v : . [ c s . C L ] S e p oisy input is not explicitly accounted for, theseapproaches resort to the use of a generic OOV (out of vocabulary) embedding for words not seenin the training phase. However, in using a sin-gle embedding for all unseen words, such ap-proaches lose the ability to distinguish obfuscatedwords from non-obfuscated or rare ones. Recently,Mishra et al. (2018) and Qian et al. (2018),working with the same Twitter dataset as we do,reported that many of the misclassiﬁcations bytheir

RNN -based methods happen due to inten-tional misspellings and/or rare words.Our contributions are two-fold: ﬁrst, we exper-imentally demonstrate that character n-gram fea-tures are complementary to the current state of theart

RNN approaches to abusive language detectionand can strengthen their performance. We then ex-plicitly address the problem of deliberately noisyinput by constructing a model that operates at thecharacter level and learns to predict embeddingsfor unseen words. We show that the integration ofthis model with the character-enhanced

RNN meth-ods further advances the state of the art in abusedetection on three datasets from two different do-mains, namely Twitter and Wikipedia talk page.To the best of our knowledge, this is the ﬁrst workto use character-based word composition modelsfor abuse detection.

Yin et al. (2009) were among the ﬁrst ones to ap-ply supervised learning to the task of abuse detec-tion. They worked with a linear support vector ma-chine trained on local (e.g., n-grams), contextual(e.g., similarity of a post to its neighboring posts),and sentiment-based (e.g., presence of expletives)features to recognize posts involving harassment.Djuric et al. (2015) worked with commentstaken from the Yahoo Finance portal and demon-strated that distributional representations of com-ments learned using the paragraph2vec frame-work (Le and Mikolov, 2014) can outperform sim-pler bag-of-words

BOW features under supervisedclassiﬁcation settings for hate speech detection.Nobata et al. (2016) improved upon the results ofDjuric et al. by training their classiﬁer on an amal-gamation of features derived from four differentcategories: linguistic (e.g., count of insult words),syntactic (e.g. part-of-speech

POS tags), distribu-tional semantic (e.g., word and comment embed-dings) and n-gram based (e.g., word bi-grams). They noted that while the best results were ob-tained with all features combined, character n-grams had the highest impact on performance.Waseem and Hovy (2016) utilized a logisticregression ( LR ) classiﬁer to distinguish amongstracist, sexist, and clean tweets in a dataset of ap-proximately k of them. They found that char-acter n-grams coupled with gender information ofusers formed the optimal feature set for the task.On the other hand, geographic and word-lengthdistribution features provided little to no improve-ment. Experimenting with the same dataset, Bad-jatiya et al. (2017) improved on their results bytraining a gradient-boosted decision tree ( GBDT )classiﬁer on averaged word embeddings learnt us-ing a long short-term memory (

LSTM ) models ini-tialized with random embeddings. Mishra et al.(2018) went on to incorporate community-basedproﬁling features of users in their classiﬁcationmethods, which led to the state of the art perfor-mance on this dataset.Waseem (2016) studied the inﬂuence of anno-tators’ knowledge on the task of hate speech de-tection. For this, they sampled k tweets fromthe same corpus as Waseem and Hovy (2016) andrecruited expert and amateur annotators to anno-tate the tweets as racism , sexism , both or neither .Combining this dataset with that of Waseem andHovy (2016), Park et al. (2017) evaluated the ef-ﬁcacy of a 2-step classiﬁcation process: they ﬁrstused an LR classiﬁer to separate abusive and non-abusive tweets, and then used another LR classiﬁerto distinguish between the racist and sexist ones.They showed that this setup had comparable per-formance to a 1-step classiﬁcation approach basedon convolutional neural networks ( CNN s) operat-ing on word and character embeddings.Wulczyn et al. (2017) created three differentdatasets of comments collected from the EnglishWikipedia Talk page: one was annotated for per-sonal attacks, another for toxicity, and the thirdfor aggression. They achieved their best resultswith a multi-layered perceptron classiﬁer trainedon character n-gram features. Working with thepersonal attack and toxicity datasets, Pavlopouloset al. (2017) outperformed the methods of Wul-czyn et al. by having a gated recurrent unit (

GRU )to model the comments as dense low-dimensionalrepresentations, followed by an LR layer to clas-sify the comments based on those representations.Davidson et al. (2017) produced a dataset ofbout k racist , offensive or clean tweets. Theyevaluated several multi-class classiﬁers with theaim of discerning clean tweets from racist and of-fensive tweets, while simultaneously being able todistinguish between the racist and offensive ones.Their best model was an LR classiﬁer trained using TF – IDF and

POS n-gram features coupled with fea-tures like count of hash tags and number of words.

Following the proceedings of the st Workshop onAbusive Language Online (Waseem et al., 2017),we use three datasets from two different domains.

Waseem and Hovy (2016) prepared a dataset of , tweets from a corpus of approximately k tweets retrieved over a period of two months.They bootstrapped their collection process with asearch for commonly used slurs and expletives re-lated to religious, sexual, gender and ethnic mi-norities. After having manually annotated , of the tweets as racism , sexism or neither , theyasked an expert to review their annotations in orderto mitigate against any biases. The inter-annotatoragreement was reported at κ = 0 . , with furtherinsight that of all the disagreements occurredin the sexism class alone.The authors released the dataset as a list of , tweet ID s and their corresponding anno-tations. We could only retrieve , of thetweets with python’s Tweepy library since someof them have been deleted or their visibility hasbeen limited. Of the ones retrieved, 1,939 (12%)are racism , 3,148 (19.4%) are sexism , and the re-maining 11,115 (68.6%) are neither ; the origi-nal dataset has a similar distribution, i.e., 11.7% racism , 20.0% sexism , and 68.3% neither . Wulczyn et al. (2017) extracted approximately M talk page comments from a public dumpof the full history of English Wikipedia releasedin January 2016. From this corpus, they ran-domly sampled comments to form three datasetson personal attack, toxicity and aggression, andengaged workers from CrowdFlower to annotatethem. Noting that the datasets were highly skewedtowards the non-abusive classes, the authors over-sampled comments from banned users to attain amore uniform distribution. In this work, we utilize the toxicity and per-sonal attack datasets, henceforth referred to as W - TOX and W - ATT respectively. Each comment inboth of these datasets was annotated by at least 10workers. We use the majority annotation of eachcomment to resolve its gold label: if a commentis deemed toxic (alternatively, attacking) by morethan half of the annotators, we label it as abusive ;otherwise, as non-abusive . 13,590 (11.7%) of the115,864 comments in W - ATT and 15,362 (9.6%)of the 159,686 comments in W - TOX are abusive.Wikipedia comments, with an average length of25 tokens, are considerably longer than the tweetswhich have an average length of 8.

We experiment with ten different methods, eightof which have an

RNN operating on word embed-dings. Six of these eight also include character n-gram features, and four further integrate our wordcomposition model. The remaining two comprisean

RNN that works directly on character inputs.

Hidden-state ( HS ). As our ﬁrst baseline, we adoptthe “

RNN ” method of Pavlopoulos et al. (2017)since it produces state of the art results on theWikipedia datasets. Given a text formed of a se-quence w , . . . , w n of words (represented by d -dimensional word embeddings), the method uti-lizes a 1-layer GRU to encode the words into hid-den states h , . . . , h n . This is followed by an LR layer that classiﬁes the text based on the last hid-den state h n . We modify the authors’ original ar-chitecture in two minor ways: we extend the 1-layer GRU to a 2-layer

GRU and use softmax as theactivation in the LR layer instead of sigmoid. Following Pavlopoulos et al., we initialize theword embeddings to GL o V e vectors (Penningtonet al., 2014). In all our methods, words not presentin the GL o V e set are randomly initialized in therange ± . , indicating the lack of semantic in-formation. By not mapping these words to a singlerandom embedding, we mitigate against the errorsthat may arise due to their conﬂation (Madhyasthaet al., 2015). A special OOV (out of vocabulary)token is also initialized in the same range. All theembeddings are updated during training, allowingfor some of the randomly-initialized ones to get We also experimented with 1-layer

GRU / LSTM and 1/2-layer bi-directional

GRU s/ LSTM s but the performance onlyworsened or showed no gains; using sigmoid instead of soft-max did not have any noteworthy effects on the results either. ask-tuned (Kim, 2014); the ones that do not gettuned lie closely clustered around the

OOV tokento which unseen words in the test set are mapped.

Word-sum ( WS ). The “

LSTM + GL o V e+ GBDT ”method of Badjatiya et al. (2017) constitutesour second baseline. The authors ﬁrst employan

LSTM to task-tune GL o V e-initialized word em-beddings by propagating error back from an LR layer. They then train a gradient-boosted decisiontree ( GBDT ) classiﬁer to classify texts based onthe average of the constituent word embeddings. We make two minor modiﬁcations to the originalmethod: we utilize a 2-layer

GRU instead of the LSTM to tune the embeddings, and we train the

GBDT classiﬁer on the L -normalized sum of theembeddings instead of their average. Hidden-state + char n-grams ( HS + CNG ). Herewe extend the hidden-state baseline: we trainthe 2-layer

GRU architecture as before, but nowconcatenate its last hidden state h n with L -normalized character n-gram counts to train a GBDT classiﬁer.

Augmented hidden-state + char n-grams(

AUGMENTED HS + CNG ). In the above methods,unseen words in the test set are simply mappedto the

OOV token since we do not have a wayof obtaining any semantic information aboutthem. However, this is undesirable since racialslurs and expletives are often deliberately fudgedby users to prevent detection. In using a singleembedding for all unseen words, we lose theability to distinguish such obfuscations from othernon-obfuscated or rare words. Taking inspirationfrom the effectiveness of character-level featuresin abuse detection, we address this issue by havinga character-based word composition model thatcan compose embeddings for unseen words in thetest set (Pinter et al., 2017). We then augment the hidden-state + char n-grams method with it. In their work, the authors report that initializing embed-dings randomly rather than with GL o V e yields state of the artperformance on the Twitter dataset that we are using. How-ever, we found the opposite when performing 10-fold strat-iﬁed cross-validation ( CV ). A possible explanation of thislies in the authors’ decision to not use stratiﬁcation, whichfor such a highly imbalanced dataset can lead to unexpectedoutcomes (Forman and Scholz, 2010). Furthermore, the au-thors train their LSTM on the entire dataset including the testpart without any early stopping criterion; this facilitates over-ﬁtting of the randomly-initialized embeddings. The deeper 2-layer

GRU slightly improves performance. L -normalized sum ensures uniformity of range acrossthe feature set in all our methods; GBDT , being a tree basedmodel, is not affected by the choice of monotonic function.

Speciﬁcally, our model (Figure 1b) comprisesa 2-layer bi-directional

LSTM , followed by a hid-den layer with tanh non-linearity and an outputlayer at the end. The model takes as input a se-quence c , . . . , c k of characters, represented asone-hot vectors, from a ﬁxed vocabulary (i.e., low-ercase English alphabet and digits) and outputs a d -dimensional embedding for the word ‘ c . . . c k ’.Bi-directionality of the LSTM allows for the se-mantics of both the preﬁx and the sufﬁx (last hid-den forward and backward state) of the input wordto be captured, which are then combined to formthe hidden state for the input word. The modelis trained by minimizing the mean squared er-ror (

MSE ) between the embeddings that it pro-duces and the task-tuned embeddings of words inthe training set. This ensures that newly com-posed embeddings are endowed with characteris-tics from both the GL o V e space as well as the task-tuning process. While approaches like that of Bo-janowski et al. (2017) can also compose embed-dings for unseen words, they cannot endow thenewly composed embeddings with characteristicsfrom the task-tuning process; this may constitute asigniﬁcant drawback (Kim, 2014).During the training of our character-based wordcomposition model, to emphasize frequent words,we feed a word as many times as it appears in thetraining corpus. We note that a 1-layer CNN withglobal max-pooling in place of the 2-layer

LSTM provides comparable performance while requiringsigniﬁcantly less time to train. This is expectedsince words are not very long sequences, and theﬁlters of the

CNN are able to capture the differentcharacter n-grams within them.

Context hidden-state + char n-grams(

CONTEXT HS + CNG ). In the augmentedhidden-state + char n-grams method, the wordcomposition model infers semantics of unseenwords solely on the basis of the characters in them.However, for many words, semantic inferenceand sense disambiguation require context, i.e.,knowledge of character sequences in the vicinity.An example is the word cnt that has differentmeanings in the sentences “

I cnt undrstand this! ”and “

You feminist cnt! ”, i.e., cannot in the formerand the sexist slur cunt in the latter. Yet anotherexample is an obfuscation like ‘’

You mot otherf ucker! where the expletive motherfucker cannot beproperly inferred from any fragment without theknowledge of surrounding character sequences. a) (b)

Figure 1 : Context-aware approach to word composition. The ﬁgure on the left shows how the encoderextracts context-aware representations of characters in the phrase “ cat sat on ” from their one-hot repre-sentations. The dotted lines denote the space character (cid:116) which demarcates word boundaries. Semanticsof an unseen word, e.g., sat , can then be inferred by our word composition model shown on the right.To address this, we develop context-awarerepresentations for characters as inputs to ourcharacter-based word composition model insteadof one-hot representations. We introduce an en-coder architecture to produce the context-awarerepresentations. Speciﬁcally, given a text formedof a sequence w , . . . , w n of words, the encodertakes as input one-hot representations of the char-acters c , . . . , c k within the concatenated sequence‘ w (cid:116) . . . (cid:116) w n ’, where (cid:116) denotes the space charac-ter. This input is passed through a bi-directional LSTM that produces hidden states h , . . . , h k , onefor every character. Each hidden state, referred toas context-aware character representation, is theaverage of its designated forward and backwardstates; hence, it captures both the preceding aswell as the following contexts of the characterit corresponds to. Figure 1 illustrates how thecontext-aware representations are extracted andused for inference by our character-based wordcomposition model. The model is trained in thesame manner as done in the augmented hidden-state + char n-grams method, i.e., by minimiz-ing the MSE between the embeddings that it pro-duces and the task-tuned embeddings of words inthe training set (initialized with GL o V e). However, We also experimented with word-level context but didnot get any signiﬁcant improvements. We believe this is dueto higher variance at word level than at the character level. the inputs now are context-aware representationsof characters instead of one-hot representations.

Word-sum + char n-grams ( WS + CNG ) , Augmented word-sum + char n-grams(

AUGMENTED WS + CNG ) , and Contextword-sum + char n-grams (

CONTEXT WS + CNG ). These methods are identical to the (context/augmented) hidden-state + char n-grams methodsexcept that here we include the character n-gramsand our character-based word composition modelon top of the word-sum baseline.

Char hidden-state (

CHAR HS ) and Char word-sum (

CHAR WS ). In all the methods described uptill now, the input to the core

RNN is word em-beddings. To gauge whether character-level inputsare themselves sufﬁcient or not, we construct twomethods based on the character to word ( C W ) ap-proach of Ling et al. (2015). For the char hidden-state method, the input is one-hot representationsof characters from a ﬁxed vocabulary. These rep-resentations are encoded into a sequence w , . . . , w n of intermediate word embeddings by a 2-layerbi-directional LSTM . The word embeddings arethen fed into a 2-layer

GRU that transforms theminto hidden states h , . . . , h n . Finally, as in the hidden-state baseline, an LR layer with softmaxactivation uses the last hidden state h n to performclassiﬁcation while propagating error backwardsto train the network. The char word-sum methods similar except that once the network has beentrained, we use the intermediate word embeddingsproduced by it to train a GBDT classiﬁer in thesame manner as done in the word-sum baseline.

We normalize the input by lowercasing all wordsand removing stop words. For the

GRU architec-ture, we use exactly the same hyper-parametersas Pavlopoulos et al. (2017), i.e., 128 hiddenunits, Glorot initialization, cross-entropy loss, andAdam optimizer (Kingma and Ba, 2015). Bad-jatiya et al. (2017) also use the same settingsexcept they have fewer hidden units. The LSTM in our character-based word composition modelhas 256 hidden units while that in our encoderhas 64; the

CNN has ﬁlters of widths varyingfrom 1 to 4. The results we report are with an

LSTM -based word composition model. In all themodels, besides dropout regularization (Srivastavaet al., 2014), we hold out a small part of the train-ing set as validation data to prevent over-ﬁtting.We use 300d embeddings and 1 to 5 character n-grams for Wikipedia and 200d embeddings and1 to 4 character n-grams for Twitter. We imple-ment the models in

Keras (Chollet et al., 2015)with

Theano back-end. We employ

Lightgbm (Keet al., 2017) as our

GDBT classiﬁer and tune itshyper-parameters using 5-fold grid search.

For the Twitter dataset, unlike previous research(Badjatiya et al., 2017; Park and Fung, 2017), wereport the macro precision, recall, and F averagedover 10 folds of stratiﬁed CV (Table 1). For aclassiﬁcation problem with N classes, macro pre-cision (similarly, macro recall and macro F ) isgiven by: M acro P = 1 N N (cid:88) i =1 P i where P i denotes precision on class i . Macro met-rics provide a better sense of effectiveness on theminority classes (Van Asch, 2013).We observe that character n-grams ( CNG ) con-sistently enhance performance, while our aug-mented approach (

AUGMENTED ) further improves The authors have not released their models; we replicatetheir method based on the details in their paper. upon the results obtained with character n-grams.All the improvements are statistically signiﬁcantwith p < . under 10-fold CV paired t-test.As Ling et al. (2015) noted in their POS taggingexperiments, we observe that the

CHAR HS and

CHAR WS methods perform worse than their coun-terparts that use pre-trained word embeddings, i.e.,the HS and WS baselines respectively.To further analyze the performance of ourbest methods ( CONTEXT / AUGMENTED WS / HS + CNG ), we also examine the results on the racismand sexism classes individually (Table 2). As be-fore, we see that our approach consistently im-proves over the baselines, and the improvementsare statistically signiﬁcant under paired t-tests.

Method

P R F HS CHAR HS HS + CNG † AUGMENTED HS + CNG † CONTEXT HS + CNG † WS CHAR WS WS + CNG † AUGMENTED WS + CNG † CONTEXT WS + CNG † Table 1 : Results on the Twitter dataset. The meth-ods we propose are denoted by † . Our best method( AUGMENTED WS + CNG ) signiﬁcantly outper-forms all other methods.

Method

P R F HS AUGMENTED HS + CNG † CONTEXT HS + CNG † WS AUGMENTED WS + CNG † CONTEXT WS + CNG † Method

P R F HS AUGMENTED HS + CNG † CONTEXT HS + CNG † WS AUGMENTED WS + CNG † CONTEXT WS + CNG † Table 2 : The baselines ( WS , HS ) vs. our best ap-proaches ( † ) on the racism and sexism classes.Additionally, we note that the AUGMENTED WS + CNG method improves the F score of the WS CNG method from 74.12 to 75.01 for the racismclass, and from 74.03 to 74.44 for the sexism class.The

AUGMENTED HS + CNG method similarly im-proves the F score of the HS + CNG method from74.00 to 74.40 on the racism class while makingno notable difference on the sexism class.We see that the

CONTEXT HS / WS + CNG meth-ods do not perform as well as the

AUGMENTEDHS / WS + CNG methods. One reason for thisis that the Twitter dataset is not able to exposethe methods to enough contexts due to its smallsize. Moreover, because the collection of thisdataset was bootstrapped with a search for certaincommonly-used abusive words, many such wordsare shared across multiple tweets belonging to dif-ferent classes. Given the above, context-awarecharacter representations perhaps do not providesubstantial distinctive information.

Following previous work (Pavlopoulos et al.,2017; Wulczyn et al., 2017), we conduct a stan-dard 60:40 train–test split experiment on the twoWikipedia datasets. Speciﬁcally, from W - TOX , , comments ( . abusive) are used fortraining and , ( . abusive) for testing;from W - ATT , , ( . abusive) are used fortraining and , ( . abusive) for testing.Table 3 reports the macro F scores. We do notreport scores from the CHAR HS and

CHAR WS methods since they showed poor preliminary re-sults compared to the HS and WS baselines. Method W - TOX W - ATT HS HS + CNG † AUGMENTED HS + CNG † CONTEXT HS + CNG † WS WS + CNG † AUGMENTED WS + CNG † CONTEXT WS + CNG † Table 3 : Macro F scores on the two Wikipediadatasets. The current state of the art method forthese datasets is HS . † denotes the methods wepropose. Our best method ( CONTEXT HS + CNG )outperforms all the other methods.Mirroring the analysis carried out for the Twit-ter dataset, Table 4 further compares the per-formance of our best methods for Wikipedia(

CONTEXT / AUGMENTED HS + CNG ) with that of the state of the art baseline ( HS ) on speciﬁcally theabusive classes of W - TOX and W - ATT . Method

P R F HS AUGMENTED HS + CNG † CONTEXT HS + CNG † (a) W - TOX

Method

P R F HS AUGMENTED HS + CNG † CONTEXT HS + CNG † (b) W - ATT

Table 4 : The current state of the art baseline ( HS )vs. our best methods ( † ) on the abusive classes of W - TOX and W - ATT .We observe that the augmented approach sub-stantially improves over the state of the art base-line. Unlike in the case of Twitter, our context-aware setup for word composition is now able tofurther enhance performance courtesy of the largersize of the datasets which increases the availabilityof contexts. All improvements are signiﬁcant ( p < . ) under paired t-tests. We note, however, thatthe gains we get here with the word compositionmodel are relatively small compared to those weget for Twitter. This difference can be explainedby the fact that: (i) Wikipedia comments are lessnoisy than the tweets and contain fewer obfusca-tions, and (ii) the Wikipedia datasets, being muchlarger, expose the methods to more words dur-ing training, hence reducing the likelihood of un-seen words being important to the semantics of thecomments they belong to (Kim et al., 2016).Like Pavlopoulos et al. (2017), we see thatthe methods that involve summation of word em-beddings ( WS ) perform signiﬁcantly worse on theWikipedia datasets compared to those that use hid-den state ( HS ); however, their performance is com-parable or even superior on the Twitter dataset.This contrast is best explained by the observationof Nobata et al. (2016) that taking average orsum of word embeddings compromises contextualand word order information. While this is ben-eﬁcial in the case of tweets which are short andloosely-structured, it leads to poor performance ofthe WS and WS + CNG methods on the Wikipediadatasets, with the addition of the word composi-tion model (

CONTEXT / AUGMENTED WS + CNG )providing little to no improvements. busive sample Predicted class

WS WS + CNG AUGMENTED WS + CNG @mention I love how the Islamofascists recruit 14 and 15 yearold jihadis and then talk about minors in reference to 17 yearolds. neither racism racism@mention @mention @mention As a certiﬁed inmate of the Is-lamasylum, you don’t have the ability to judge. neither racism racism@mention “I’ll be ready in 5 minutes” from a girl usuallymeans “I’ll be ready in 20+ minutes.”

Table 5 : Improved classiﬁcation upon the addition of character n-grams (

CNG ) and our word compositionmodel (

AUGMENTED ). Names of users have been replaced with mention for anonymity.

To investigate the extent to which obfuscatedwords can be a problem, we extract a numberof statistics. Speciﬁcally, we notice that out ofthe approximately k unique tokens present inthe Twitter dataset, there are about . k tokensthat we cannot ﬁnd in the English dictionary. Around of these . k tokens are present inthe racist tweets, . k in the sexist tweets, andthe rest in tweets that are neither. Examples fromthe racist tweets include fuckbag , ezidiz , islamo-fascists , islamistheproblem , islamasylum and isis-aremuslims , while those from the sexist tweets in-clude c*nt , bbbbitch , feminismisawful , and stupid-bitch . Given that the racist and sexist tweets comefrom a small number of unique users, 5 and 527respectively, we believe that the presence of ob-fuscated words would be even more pronounced iftweets were procured from more unique users.In the case of the Wikipedia datasets, around k unique tokens in the abusive comments ofboth W - TOX and W - ATT are not attested in the En-glish dictionary. Examples of such tokens from W - TOX include fuggin , n*gga , fuycker , and ;and from W - ATT include f**king , beeeitch , musul-mans , and motherfucken . In comparison to thetweets, the Wikipedia comments use more “stan-dard” language. This is validated by the fact thatonly 14% of the tokens present in W - TOX and W - ATT are absent from the English dictionary as op- We use the

US E nglish spell-checking utility provided bythe

PyEnchant library of python. posed to 32% of the tokens in the Twitter dataseteven though the Wikipedia ones are almost tentimes larger.Across the three datasets, we note that the ad-dition of character n-gram features enhances theperformance of

RNN -based methods, corroborat-ing the previous ﬁndings that they capture com-plementary structural and lexical information ofwords. The inclusion of our character-based wordcomposition model yields state of the art resultson all the datasets, demonstrating the beneﬁts ofinferring the semantics of unseen words. Table5 shows some abusive samples from Twitter thatare misclassiﬁed by the WS baseline method butare correctly classiﬁed upon the addition of char-acter n-grams ( WS + CNG ) and the further additionof our character-based word composition model(

AUGMENTED WS + CNG ).Many of the abusive tweets that remain mis-classiﬁed by the

AUGMENTED WS + CNG methodare those that are part of some abusive dis-course (e.g., @Mich McConnell Just “her body”right? ) or contain

URL s to abusive content (e.g., @salmonfarmer1: Logic in the world of Islamhttp://t.co/6nALv2HPc3 ).In the case of the Wikipedia datasets, there areabusive examples like smyou have a message reyour last change, go fuckyourself!!! and

F-uc-kyou, a-ss-hole Motherf–ucker! that are misclassi-ﬁed by the state of the art HS baseline and the HS + CNG method but correctly classiﬁed by our bestmethod for the datasets, i.e.,

CONTEXT HS + CNG . ord Similar words in training set women girls , woman , females , chicks , ladiesw0m3n † woman , women , girls , ladies , chickscunt twat , prick , faggot , slut , assholea5sh0les † assholes , stupid , cunts , twats , faggotsstupidbitch † idiotic , stupid , dumb , ugly , womenjihad islam , muslims , sharia , terrorist , jihadijihaaadi † terrorists , islamist , jihadists , muslimsterroristislam † terrorists , muslims , attacks , extremistsfuckyouass † fuck , shit , fucking , damn , hell Table 6 : Words in the training set that exhibit highcosine similarity to the given word. The onesmarked with † are not seen during training; em-beddings for them are composed using our wordcomposition model.To ascertain the effectiveness of our task-tuningprocess for embeddings, we conducted a quali-tative analysis, validating that semantically simi-lar words cluster together in the embedding space.Analogously, we assessed the merits of our wordcomposition model by verifying the neighbors ofembeddings formed by it for obfuscated words notseen during training. Table 6 provides some exam-ples. We see that our model correctly infers the se-mantics of obfuscated words, even in cases whereobfuscation is by concatenation of words. In this paper, we considered the problem of obfus-cated words in the ﬁeld of automated abuse detec-tion. Working with three datasets from two differ-ent domains, namely Twitter and Wikipedia talkpage, we ﬁrst comprehensively replicated the pre-vious state of the art

RNN methods for the datasets.We then showed that character n-grams capturecomplementary information, and hence, are ableto enhance the performance of the

RNN s. Finally,we constructed a character-based word composi-tion model in order to infer semantics for unseenwords and further extended it with context-awarecharacter representations. The integration of ourcomposition model with the enhanced

RNN meth-ods yielded the best results on all three datasets.We have experimentally demonstrated that our ap-proach to modeling obfuscated words signiﬁcantlyadvances the state of the art in abuse detection. Inthe future, we wish to explore its efﬁcacy in taskssuch as grammatical error detection and correc-tion. We will make our models and logs of experi-ments publicly available at https://github.com/pushkarmishra/AbuseDetection . Acknowledgements

Special thanks to the anonymous reviewers fortheir valuable comments and suggestions.

References

Pinkesh Badjatiya, Shashank Gupta, Manish Gupta,and Vasudeva Varma. 2017. Deep learning for hatespeech detection in tweets. In

Proceedings of the26th International Conference on World Wide WebCompanion , WWW ’17 Companion, pages 759–760, Republic and Canton of Geneva, Switzerland.International World Wide Web Conferences Steer-ing Committee.Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information.

Transactions of the Associa-tion for Computational Linguistics , 5:135–146.Franc¸ois Chollet et al. 2015. Keras.Thomas Davidson, Dana Warmsley, Michael Macy,and Ingmar Weber. 2017. Automated hate speechdetection and the problem of offensive language. In

Proceedings of the 11th International AAAI Confer-ence on Web and Social Media , ICWSM ’17.Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Gr-bovic, Vladan Radosavljevic, and Narayan Bhamidi-pati. 2015. Hate speech detection with commentembeddings. In

Proceedings of the 24th Interna-tional Conference on World Wide Web , WWW ’15Companion, pages 29–30, New York, NY, USA.ACM.Maeve Duggan. 2014. Online harassment.George Forman and Martin Scholz. 2010. Apples-to-apples in cross-validation studies: Pitfalls in clas-siﬁer performance measurement.

SIGKDD Explor.Newsl. , 12(1):49–57.Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang,Wei Chen, Weidong Ma, Qiwei Ye, and Tie-YanLiu. 2017. Lightgbm: A highly efﬁcient gradientboosting decision tree. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, editors,

Advances in Neural Infor-mation Processing Systems 30 , pages 3149–3157.Curran Associates, Inc.Yoon Kim. 2014. Convolutional neural networks forsentence classiﬁcation. In

Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pages 1746–1751. As-sociation for Computational Linguistics.Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M. Rush. 2016. Character-aware neural lan-guage models. In

Proceedings of the Thirtieth AAAIConference on Artiﬁcial Intelligence , AAAI’16,pages 2741–2749. AAAI Press.iederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In

Proceed-ings of the 3rd International Conference on Learn-ing Representations , ICLR ’15.Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In

Proceed-ings of the 31st International Conference on Inter-national Conference on Machine Learning , ICML’14.Wang Ling, Chris Dyer, Alan W Black, Isabel Tran-coso, Ramon Fermandez, Silvio Amir, Luis Marujo,and Tiago Luis. 2015. Finding function in form:Compositional character models for open vocabu-lary word representation. In

Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing , pages 1520–1530. Associationfor Computational Linguistics.Pranava Swaroop Madhyastha, Mohit Bansal, KevinGimpel, and Karen Livescu. 2015. Mapping unseenwords to task-trained embedding spaces.

CoRR ,abs/1510.02387.Pushkar Mishra, Marco Del Tredici, Helen Yan-nakoudakis, and Ekaterina Shutova. 2018. Authorproﬁling for abuse detection. In

Proceedings ofthe 27th International Conference on ComputationalLinguistics , pages 1088–1098. Association for Com-putational Linguistics.Chikashi Nobata, Joel Tetreault, Achint Thomas,Yashar Mehdad, and Yi Chang. 2016. Abusive lan-guage detection in online user content. In

Proceed-ings of the 25th International Conference on WorldWide Web , WWW ’16, pages 145–153, Republic andCanton of Geneva, Switzerland. International WorldWide Web Conferences Steering Committee.Ji Ho Park and Pascale Fung. 2017. One-step and two-step classiﬁcation for abusive language detection ontwitter. In

Proceedings of the First Workshop onAbusive Language Online , pages 41–45. Associationfor Computational Linguistics.John Pavlopoulos, Prodromos Malakasiotis, and IonAndroutsopoulos. 2017. Deep learning for usercomment moderation. In

Proceedings of the FirstWorkshop on Abusive Language Online , pages 25–35. Association for Computational Linguistics.Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In

Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1532–1543.Yuval Pinter, Robert Guthrie, and Jacob Eisenstein.2017. Mimicking word embeddings using subwordrnns. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing , pages 102–112. Association for ComputationalLinguistics. J. Qian, M. ElSherief, E. Belding, and W. Wang.2018. Leveraging intra-user and inter-user represen-tation learning for automated hate speech detection.

NAACL HLT, New Orleans, LA, June 2018. , page toappear .Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overﬁtting.

Journal of Machine Learning Re-search , 15:1929–1958.Vincent Van Asch. 2013. Macro-and micro-averagedevaluation measures [[basic draft]].

Computa-tional Linguistics & Psycholinguistics, University ofAntwerp, Belgium .Zeerak Waseem. 2016. Are you a racist or am I seeingthings? annotator inﬂuence on hate speech detectionon twitter. In

Proceedings of the First Workshop onNLP and Computational Social Science , pages 138–142. Association for Computational Linguistics.Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy,and Joel Tetreault. 2017. Proceedings of the ﬁrstworkshop on abusive language online. In

Proceed-ings of the First Workshop on Abusive Language On-line . Association for Computational Linguistics.Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-bols or hateful people? predictive features for hatespeech detection on twitter. In

Proceedings of theNAACL Student Research Workshop , pages 88–93,San Diego, California. Association for Computa-tional Linguistics.Ellery Wulczyn, Nithum Thain, and Lucas Dixon.2017. Ex machina: Personal attacks seen at scale.In

Proceedings of the 26th International Conferenceon World Wide Web , WWW ’17, pages 1391–1399,Republic and Canton of Geneva, Switzerland. In-ternational World Wide Web Conferences SteeringCommittee.Dawei Yin, Brian D. Davison, Zhenzhen Xue, LiangjieHong, April Kontostathis, and Lynne Edwards.2009. Detection of harassment on web 2.0. In