QutNocturnal@HASOC'19: CNN for Hate Speech and Offensive Content Identification in Hindi Language
aa r X i v : . [ c s . C L ] A ug QutNocturnal@HASOC’19: CNN for HateSpeech and Offensive Content Identification inHindi Language
Md Abul Bashar [0000 − − − and Richi Nayak [0000 − − − School of Electrical Engineering and Computer ScienceQueensland University of Technology, Brisbane, Australia { m1.bashar, r.nayak } @qut.edu.au Abstract.
We describe our top-team solution to Task 1 for Hindi in theHASOC contest organised by FIRE 2019. The task is to identify hatespeech and offensive language in Hindi. More specifically, it is a binaryclassification problem where a system is required to classify tweets intotwo classes: (a)
Hate and Offensive (HOF) and (b)
Not Hate or Offensive(NOT) . In contrast to the popular idea of pretraining word vectors (a.k.a.word embedding) with a large corpus from a general domain such asWikipedia, we used a relatively small collection of relevant tweets (i.e.random and sarcasm tweets in Hindi and Hinglish) for pretraining. Wetrained a Convolutional Neural Network (CNN) on top of the pretrainedword vectors. This approach allowed us to be ranked first for this task outof all teams. Our approach could easily be adapted to other applicationswhere the goal is to predict class of a text when the provided context islimited.
Keywords:
Hate Speech · Offensive Content · Hindi · CNN · DeepLearning.
The “Hate Speech and Offensive Content Identification in Indo-European Lan-guages” track (HASOC) is one of the tracks in FIRE 2019 conference [14]. Task1 in this track is identification of hate speech and Offensive (HOF) language inEnglish, German and Hindi in social media posts. In this paper, we describeour approach to the solution of Task 1 in Hindi. The goal is to label a tweetwritten in Hindi as HOF if it contains any form of non-acceptable language suchas hate speech, aggression or profanity; otherwise it is labelled as NOT. Therehas been significant research on hate speech and offensive content identificationin several languages, especially in English [3,2,6,25,24]. However, there is a lackof work in most other languages. People are now realising the urgency of suchresearch in other languages. Recently, SemEval 2019 Task 5 [4] was carried out https://hasoc2019.github.io/call for participation.html http://fire.irsi.res.in/fire/2019/home Copyright c (cid:13)
Bashar & Nayak on detecting hate speech against immigrants and women in Spanish and En-glish messages extracted from Twitter, GermEval Share Task [22] was carriedout on the Identification of Offensive Language in German language tweets, andTRAC-1 [11] conducted a shared task on aggression identification in Hindi andEnglish. Therefore, HASOC Task 1 for Hindi intends to find out the quality ofhate speech and offensive content identification technology in Hindi.The training dataset is comprised of 4665 labelled tweets in Hindi. The train-ing dataset is created from Twitter and participants are allowed to use externaldatasets for this task. In the competition setup, the testing dataset is com-prised of 1319 unlabelled tweets that were also created from Twitter. The testingdataset and leaderboard were kept unknown to participants until the results wereannounced. Competitors had to split the training set to get validation set anduse the validation set through the competition to compare models. The testingset was only used at the end of the competition for the final leaderboard.Th proposed approach relies on very little feature-engineering and prepro-cessing as compared to many existing approaches. Section 2 discusses our top-ranked model building approach. It consists of two steps: (a) pretraining wordvectors using a relevant collection of unlabelled tweets and (b) training a Convo-lutional Neural Network (CNN) model using the labelled training set on top ofthe pretrained word vectors. Section 3 describes other sophisticated alternativemodels that we tried. Though these models did not perform as good as com-pared to our winning model in this track, their performance provides furtherinsight into how to use machine learning models for identifying hate speech andoffensive language in Hindi. Section 4 provides experimental results comparingand analysing our various models both on testing set and validation set. Thesource code of our model can be found online at [1].
The goal of Task 1 for Hindi is to predict theclass (HOF or NOT) of a given tweet written in Hindi. Out of 4665 labelledtweets in the training set, 2469 (52.92%) are HOF and 2196 (47.07%) are NOT.We randomly kept 20% of training data for validation set. We used ten crossvalidation in the remaining training set for hyper parameter setting.
Unlabelled External Dataset
It is a difficult task to separate abusive tweetsfrom tweets that are sarcastic, joking, or contained abusive keywords in a non-abusive context [3]. Lexical detection methods tend to have low accuracy [6,23]because they classify a tweet as abusive if it contains any abusive keywords. Alsotweets are significantly noisy and do not follow a standard language format.For example, words in tweets are often misspelled, altered, written in Romanletters, include local dialects or foreign languages. To transfer the knowledge ofthese contexts to the CNN based deep learning model, we pretrain word vectors itle Suppressed Due to Excessive Length 3 using 0.5 million relevant tweets. More specifically, we collected 4,94,311 randomtweets in Hindi (i.e. topic of discussion can be anything) using TrISMA and 5251sarcasm tweets in Hinglish [15] (i.e. sarcasm in Hindi language but written inRoman letters) from [19] for pretraining. Preprocessing
We de-identified person occurrence (e.g. @someone) with xxatp ,url occurence with xxurl , source of modified retweet with xxrtm and source ofunmodified retweet with xxrtu . We fixed the repeating characters (e.g. goooood)in word and removed common invalid characters (e.g. < br/ > , < unk > , @ − @,etc). We used html unescape to replace hexadecimal escape sequences with thecharacter that it represents. We used multi-language spaCy module to lemma-tize words and a lightweight stemmer for Hindi language [18] for stemming thewords. Embedding models quantify semantic similarities between words based on theirdistributional property that a word is characterised by the company it keeps.These models quantify semantic properties of words by mapping co-occurringwords close to each other in an Euclidean space. Given a sizeable corpus, thesemodels can effectively learn a high-quality word embedding from the co-occurrenceof words in the corpus. Word embedding maps each word from the vocabularyto a vector of real numbers. Mikolov et al. [16] proposed two popular modelsfor word embedding based on the feed-forward neural network: Skip-gram andContinuous Bag-of-Words as shown in Figure 1.In embedding models, a sliding window of a fixed size moves along the text ofa corpus. For a given position of the sliding window, let the word in the middleis current word w i and the words on its left and right within the sliding windoware context words C . The continuous bag-of-words model predicts the currentword w i from the surrounding context words C , i.e. p ( w i | C ). In contrast, theskip-gram model uses the current word w i to predict the surrounding contextwords C , i.e. p ( C | w i ). In Figure 1, for example in this corpus, if the currentposition of a running sliding window contains the phrase tum sirf chutiya kat tiho . In continuous bag-of-words, the context words { tum, sirf, kat, ti, ho } can beused to predict the current word { chutiya } , whereas, in skip-gram, the currentword { chutiya } can be used to predict the context words { tum, sirf, kat, ti, ho } .The objective of model training is to find a word embedding that maximises p ( w i | C ) or p ( C | w i ) over a corpus. In each step of training, each word is either(a) pulled closer to the words that co-occur with it or (b) pushed away fromall the words that do not co-occur with it. A softmax or approximate softmax function can be used to achieve this objective [16]. At the end of the training,the embedding brings closer not only the words that are explicitly co-occurring https://research.qut.edu.au/dmrc/projects/trisma-tracking-infrastructure-for-social-media-analysis/ https://spacy.io/models/xx Bashar & Nayak w i-2 w i-1 w i+1 w i+2 w i-2 w i-1 w i+1 w i+2 ∑ w i ∑ w i Input Projection Output Input Projection OutputContinuous bag-of-words Skip-gram S li d i ng W i ndo w S li d i ng W i ndo w Fig. 1: Continuous Bag-of-Words and Skip-gram Word Embedding Models [3] in a training dataset, but also the words that implicitly co-occur. For example,if w explicitly co-occurs with w and w explicitly co-occurs with w , then themodel can bring closer not only w to w , but also w to w .We use the continuous bag-of-words model in this contest as this model isfaster and has a slightly better accuracy for the words that appear frequentlybased on our experimental results. We implemented this model using the moduleWord2Vec in Gensim Python library. We set the word vector dimension to 200,minimum word count to 2, number of iteration in pretraining to 10, slidingwindow size to 5 and maximum vocabulary count to 0. We run this model onthe unlabelled external dataset described in Section 2.1 to get the pretrain wordvectors. Our pretrained word vectors and corresponding python code to use themin classifier are available online at [1]. The proposed architecture of our top-ranked model CNN to identify hate speechand offensive language in Hindi is given in Figure 2. This is an empirically cus-tomised and regulated version of the architecture that we have used in our priorwork of misogynistic tweets identification on Tweeter [3]. In this architecture, weuse word embedding to represent each word w in an n -dimensional word vector w ∈ R n . We represent a tweet t with m words as a matrix t ∈ R m × n . We applyconvolution operation to the tweet matrix with one stride. Each convolution op-eration applies a filter f i ∈ R h × n of size h . Empirically, based on the accuracyimprovement in ten-fold cross validation, 256 filters are used for h ∈ { , } and512 filters for h ∈ { } . The convolution is a function c ( f i , t ) = r ( f i · t k : k + h − ),where t k : k + h − is the k th vertical slice of the tweet matrix from position k to k + h − f i is the given filter and r is a Rectified Linear Unit (ReLU) function[17]. The function c ( f i , t ) produces a feature c k similar to n Grams for each slice k , resulting in m − h + 1 features. The max-pooling operation [20] is applied overthese features and the maximum value is taken, i.e. ˆ c i = max ( c ( f i , t )). Max-pooling captures the most important feature for each filter. As there are a total itle Suppressed Due to Excessive Length 5 of 1024 filters (256+256+512) in the proposed model, the 1024 most importantfeatures are learned from the convolution layer.Then, we pass these features to a fully connected hidden layer with 256perceptrons that use the ReLU activation function. This fully connected hiddenlayer learns the complex non-linear interactions between the features from theconvolution layer and generates 256 higher level new features. Finally, we passthese 256 higher level features to the output layer with single perceptron thatuses the sigmoid activation function. The perceptron in output layer generatesthe probability of the tweet being HOF or NOT.In this architecture (Figure 2), a proportion of units are randomly dropped-out from each layer except the output. This is done to prevent co-adaptation ofunits in a layer and to reduce overfitting. We set 50% units droppedout from theinput layer, the filters of size 3 and the fully connected hidden layer based onbest empirical results. Only 20% units are droppedout from the filters of size 4and 5. Python code for this model is available online at [1]. couldbepartumsirfchutiyakattiho Tweet Matrix Convolution Max Pooling Concatenate Fully Connected Layers
Fig. 2: Architecture of our top-ranked CNN Model for the Hate Speech and OffensiveContent Identification track in Hindi Language [3]
We have implemented eight other models in addition to the winning CNN modelto see the performance of hate speech and offensive language detection in Hindi. – Long Short-Term Memory Network (LSTM) [9]. We implement LSTM with100 units, 50% dropout, binary cross-entropy loss function, Adam optimiserand sigmoid activation. – Feedforward Deep Neural Network (DNN) [7]. We implement DNN withfive hidden layers, each layer containing eight units, 50% dropout applied tothe input layer and the first two hidden layers, softmax activation and 0.04learning rate. We manually tuned hyper parameters of all neural networkbased models (CNN, LSTM, DNN) based on cross-validation.
Bashar & Nayak – Non NN models including Support Vector Machines (SVM) [8], RandomForest (RF) [13], XGBoost (XGB) [5], Multinomial Naive Bayes (MNB)[12], k-Nearest Neighbours (kNN) [21] and Ridge Classifier (RC) [10]. Weautomatically tune hyper parameters of all these models using ten-fold cross-validation and GridSearch from scikit-learn. Among all the models, onlyCNN and LSTM use transfer learning.
A total of nine machine learning models, including the winning customised CNNmodel, were trained to identify hate speech and offensive language in Hindi. Weused transfer learning of word vectors for both CNN and LSTM. The wordvectors were pre-trained on a collection of relevant tweets and tuned with thetraining dataset during the model training.
The experimental results comparing models in custom validation set are givenin Table 1. The detailed results of the winning CNN model in test dataset are
Table 1: Model Comparison Results in Custom Validation Set
Macro Average of ClassesCNN DNN kNN LSTM MNB RF RC SVM XGBprecision given in Table 2. In the absence of any other information except the email message about the top-teamperformance, we are not able to provide the comparative results with other submittedteam results. We will update this table with the rest of the team performance, oncewe receive information from the track organisers.itle Suppressed Due to Excessive Length 7Table 2: Detailed Results of Winning Model CNN in Test Dataset
Confusion MatrixHOF NOT446 159 HOF80 633 NOTClass Wise PerformancePrecision Recall F -score SupportHOF 0.85 0.74 0.79 605NOT 0.8 0.89 0.84 713Accuracy 0.82 1318Macro avg 0.82 0.81 0.81 1318Weighted avg 0.82 0.82 0.82 1318 Experimental results in both validation and test set show that CNN outperformsall other models. CNN is able to outperform LSTM and other baseline modelsbecause of the specific nature of tweets. For example, tweets can be super con-densed and indirect texts (e.g. satire), may not follow the standard sequence ofthe language and be full of noise.Traditional models (e.g. SVM, XGBoost, RF, kNN, etc.) are based on bag-of-words assumption. The bag-of-words (or bag-of-phrases) representation cannotcapture sequences and patterns that are very important to identify hate speechand offensive contents in tweets. For example, if a tweet ends saying if you knowwhat I mean , there is a high chance that it is an offensive tweet, even thoughindividual words are innocent.A LSTM model is popularly used in natural language processing researchbecause of its effectiveness of handling sequences in text datasets. Empiricalresults in Table 1 show that it performed as a second best model. However,the sequence in a tweet can be highly impacted by the noise [3,23], consequentlyLSTM finds it difficult to identify the class. On the other hand, CNN can identifymany small and large patterns in a tweet, if some of them are impacted by noiseit can still use other patterns to identify the class.
We introduce an effective method for the task of hate speech and offensive con-tent identification in Hindi. We propose a custom CNN architecture built onword vectors pre-trained on a relevant corpus from the task-specific domain.The proposed model was the top-ranked model in this task under the track.We conducted a series of experiments conducted using state-of-the-art models.Experimental results show that the contexts of hate speech and offensive con-tent can be captured through transfer learning of word embeddings (a.k.a. wordvectors) and those contexts can significantly improve the performance of hatespeech and offensive content identification. We also observed that when trans-fer learning through word vectors is utilised, CNN performs better than LSTMbecause of the noisy nature of tweets. CNN can identify many small and largepatterns in a tweet, if some of them gets altered by noise it can still use other
Bashar & Nayak patterns to identify the class of the tweet. On the other hand, LSTM uses thesequence of a tweet to identify its class, but noise in the tweet can alter thesequence and make it hard for LSTM to identify the class.
References
1. Python code and pretrained word vectors of qutnocturnal-hasoc2019. https://github.com/mdabashar/QutNocturnal-Hasoc2019 , accessed: 04-10-20192. Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speechdetection in tweets. In: Proceedings of the 26th International Conference on WorldWide Web Companion. pp. 759–760. International World Wide Web ConferencesSteering Committee (2017)3. Bashar, M.A., Nayak, R., Suzor, N., Weir, B.: Misogynistic tweet detection: Mod-elling cnn with small datasets. In: Australasian Conference on Data Mining. pp.3–16. Springer (2018)4. Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Pardo, F.M.R., Rosso,P., Sanguinetti, M.: Semeval-2019 task 5: Multilingual detection of hate speechagainst immigrants and women in twitter. In: Proceedings of the 13th InternationalWorkshop on Semantic Evaluation. pp. 54–63 (2019)5. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedingsof the 22nd acm sigkdd international conference on knowledge discovery and datamining. pp. 785–794. ACM (2016)6. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detectionand the problem of offensive language. arXiv preprint arXiv:1703.04009 (2017)7. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforwardneural networks. In: Proceedings of the thirteenth international conference on ar-tificial intelligence and statistics. pp. 249–256 (2010)8. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vectormachines. IEEE Intelligent Systems and their applications (4), 18–28 (1998)9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)10. Hoerl, A.E., Kennard, R.W.: Ridge regression: applications to nonorthogonal prob-lems. Technometrics (1), 69–82 (1970)11. Kumar, R., Ojha, A.K., Malmasi, S., Zampieri, M.: Benchmarking aggression iden-tification in social media. In: Proceedings of TRAC (2018)12. Lewis, D.D.: Naive (bayes) at forty: The independence assumption in informationretrieval. In: European conference on machine learning. pp. 4–15. Springer (1998)13. Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R news (3), 18–22 (2002)14. Mandl, T., Modha, S., Patel, D., Dave, M., Mandlia, C., Patel, A.: Overview ofthe HASOC track at FIRE 2019: Hate Speech and Offensive Content Identificationin Indo-European Languages). In: Proceedings of the 11th annual meeting of theForum for Information Retrieval Evaluation (December 2019)15. Mathur, P., Shah, R., Sawhney, R., Mahata, D.: Detecting offensive tweets inhindi-english code-switched language. In: Proceedings of the Sixth InternationalWorkshop on Natural Language Processing for Social Media. pp. 18–26 (2018)16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-sentations of words and phrases and their compositionality. In: Advances in neuralinformation processing systems. pp. 3111–3119 (2013)itle Suppressed Due to Excessive Length 917. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-chines. In: Proceedings of the 27th international conference on machine learning(ICML-10). pp. 807–814 (2010)18. Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In: Workshop onComputational Linguistics for South-Asian Languages, EACL (2003)19. Swami, S., Khandelwal, A., Singh, V., Akhtar, S.S., Shrivastava, M.: A cor-pus of english-hindi code-mixed tweets for sarcasm detection. arXiv preprintarXiv:1805.11869 (2018)20. Tolias, G., Sicre, R., J´egou, H.: Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879 (2015)21. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearestneighbor classification. Journal of Machine Learning Research10