Comparison of Classical Machine Learning Approaches on Bangla Textual Emotion Analysis
CComparison of Classical Machine Learning Approaches on BanglaTextual Emotion Analysis
Md. Ataur Rahman
Language Science and TechnologyUniversity of SaarlandSaarbr¨ucken, Germany [email protected]
Md. Hanif Seddiqui
Computer Science and EngineeringUniversity of ChittagongChittagong, Bangladesh [email protected]
Abstract
Detecting emotions from text is an extensionof simple sentiment polarity detection. Insteadof considering only positive or negative senti-ments, emotions are conveyed using more tan-gible manner; thus, they can be expressed asmany shades of gray. This paper manifeststhe results of our experimentation for fine-grained emotion analysis on Bangla text. Wegathered and annotated a text corpus consist-ing of user comments from several Facebookgroups regarding socio-economic and polit-ical issues, and we made efforts to extractthe basic emotions (sadness, happiness, dis-gust, surprise, fear, anger) conveyed throughthese comments. Finally, we compared the re-sults of the five most popular classical machinelearning techniques namely Na¨ıve Bayes, De-cision Tree, k-Nearest Neighbor (k-NN), Sup-port Vector Machine (SVM) and K-MeansClustering with several combinations of fea-tures. Our best model (SVM with a non-linearradial-basis function (RBF) kernel) achievedan overall average accuracy score of %and an F1 score (macro) of . Sentiment analysis or opinion mining is the taskof automatically analyzing text documents usingcomputational methods to obtain the opinions ofthe authors about specific entities, such as peo-ple, companies, events or products. At present,the web has become an excellent source of opin-ions about entities, particularly with the increasedpopularity of social media. People are express-ing their opinions through reviews, forum discus-sions, blogs, tweets, comments and posts. Individ-uals and organizations are increasingly using theseopinions for decision-making purposes.Thus we took an initiative that aims at devel-oping and annotating a text corpus in Bangla fordoing fine-grained emotion analysis. The term
Emotion Analysis is used because instead of di-viding the corpus based on only positive and neg-ative sentiments, we tried to consider more fine-grained emotion labels such as sadness, happi-ness, disgust, surprise, fear and anger - whichare, according to Paul Ekman (1999), the six ba-sic emotion categories. Next, we tried to imple-ment five different classical machine learning al-gorithms, namely the Na¨ıve Bayes, Decision Tree,k-Nearest Neighbours, Support Vector Machineand K-Means clustering on our corpus. Thus thecontributions of this paper can briefly be seen inthree major folds:1. We will present a manually annotated Banglaemotion corpus, which incorporates the di-versity of fine-grained emotion expressionsin social-media text.2. We will employ classical machine-learningapproaches that typically perform well inclassifying the six aforementioned emotiontypes.3. We will compare the machine-learningclassiers performance with a baseline toidentify the best-performing model for fine-grained emotion classification.Using our own carefully curated gold standardcorpus, we will report our preliminary efforts totrain and evaluate machine learning models foremotion classification in Bangla text. Our ex-perimental results show that a non-linear SVMachieved the best performance with an accuracyscore of , and an F-score of (macro)and (micro) among all the tested classifiers.
To our knowledge, the reliable literature on fine-grained emotion tagging for Bangla is very lim-ited. For example, a reliable research work wasthat of Das and Bandyopadhyay (2010b). In their a r X i v : . [ c s . C L ] J u l ork, they annotated a random collection of 123blog posts consisting a total of 12,149 sentences.The task was mainly focused on observing the per-formance of different machine learning classifiers.On a small subset of 200 test sentences, the Con-ditional Random Field (CRF) achieved an averageaccuracy of whereas the SVM scored .In a different paper (Das and Bandyopadhyay,2010a), the authors described the preparation ofthe Bengali WordNet Affect containing six typesof emotion words. They employed an automaticmethod of sense disambiguation. The BengaliWordNet Affect could be useful for emotion-related language processing tasks in Bengali.In his paper, Das (2011) delineates the identi-fication of the emotional contents at a document-level along with their associated holders and top-ics. Additionally, he manually annotated a smallcorpus. By applying sense-based affect-estimationtechniques, he gained a micro F-score of and in terms of ‘emotion holder’ and ‘emotiontopic’ identification.On a case study for Bengali (Das et al., 2012),the authors considered 1,100 sentences on eightdifferent topics. They prepared a knowledge basefor emoticons and also employed a morphologi-cal analyzer to identify the lexical keywords fromthe Bengali WordNet Affect lists. They claimedan overall precision, recall and F1-Score (micro)of , and respectively.Jasy and Howard (2016) investigated certainprevalent machine learning techniques on coarse-grained emotions for English. They used thegrounded-theory method to construct a corpus of5,553 tweets, manually annotated with 28 emotioncategories. They showed that SVM and BayesNetoutperformed all the classifiers. The BayesNetcorrectly predicted roughly 60% of the instances,whereas the SVM was correct on 50% of the cases. We used two different datasets in our experiment.The first was the
Part-of-Speech (POS) Tagset:Bengali (Dandapat et al., 2009) for POS-tagging.The dataset that we were able to obtain containedapproximately 3K sentences and 42K words in it’soriginal form, and it contained a broad category of32 tagsets . https://github.com/abhishekgupta92/bangla_pos_tagger/tree/master/data For the task of the Bangla emotion classifica-tion, we annotated 6,314 comments from three dif-ferent Facebook groups. These comments weremostly reactions to ongoing socio-political is-sues and concerned the success and failure of theBangladesh government.
For the purpose of our experiment, we took a bal-anced set from the aforementioned data and di-vided it into a training and a test set of an equalratio. We considered a proportion of 5:1 for train-ing and evaluation purposes. Table 1 summarizesthis distribution.
Labels Training Set Testing Set sad 1000 200happy 1500 300disgust 500 100surprise 400 80fear 300 60angry 1000 200
Total
Table 1: Distribution of emotion classes in the dataset.
In this section, we will describe the preprocess-ing and feature-selection techniques, including thePOS tagging approach that we have considered forthe emotion recognition models. Finally, the base-line setting for evaluation and further model opti-mization will be introduced.
Apart from cleaning the data, we also used cer-tain simple text preprocessing techniques. We to-kenized the words using a specialized tokenizerfor Bangla from spaCy (Honnibal and Montani,2017). Moreover, we experimented by filtering outstop words.We have explored two types of feature vectorsnamely the count vectorizer and a tf-idf vectorizerwith a combination of n-grams (ranging from un-igrams to trigrams) from scikit-learn (Pedregosaet al., 2011). Furthermore, we investigated the ef-fect of POS tagging for feature reduction on ourbest model. .2 POS Tagging
For the purpose of POS tagging, we implementeda H idden Markov Model (HMM) based tagger.The original POS tagger is capable of tagging 32tags with an accuracy of % over the Bangladataset (Section 3). By looking upon several com-binations, we only considered five tags ( ‘JJ’, ‘CX’,‘VM’, ‘NP’ and ‘AMN’ ) that were the most signif-icant to emotion-related words. As baseline measure, we used a k-NN classi-fier with word unigrams plus count as features.The value of k-nearest neighbours was set to 15( k=15 ). For evaluation, we will compare the re-sults of this baseline model with the optimizedmodel for each of the classifiers in Section 5.
For the evaluation, we will first delineate the re-sults of our baseline classifier (k-NN). Then, wewill attempt to find the best model based on theresults obtained in Section 5.1 through 5.6.
Table 2 lists in detail the results of the baselinemodel, whereas Table 3 summarizes the overallaccuracy and the F1(macro) score.
Labels Precision Recall F1(micro) angry 0.125 0.020 0.034disgust 1.000 0.010 0.020fear 0.000 0.000 0.000happy sad 0.421 0.040 0.073surprise 0.125 0.037 0.058
Average
Table 2: Results of K-NN Classifier as the BaselineModel with
K=15 . Evaluation Matrix Score
Accuracy
F1(macro)
Table 3: Average score of the Baseline Model.
From Table 2 and Fig. 1, it may be observed thatthe baseline classifier predicts almost every classas being ‘happy’ . This could be the result of theclassifier being biased towards this particular la-bel because the maximum number in the trainingexample was supplied for the category ‘happy’ . Figure 1: Confusion Matrix (Accuracy) for Baseline.
Although the baseline k-NN model performedquite poorly, we attempted to tune the parametersfor the K-NN classifier to identify the best k-valuefor our data. Table 4 and Fig. 2 presents the re-sults of the k-NN classifier for various k-values. Itshould be noted that here, we only considered the tf-idf feature because it yielded better results thanthe count feature. Considering the data and theplot (Fig 2), it is obvious that the classifier pro-duces the best outputs for the k value of . K-Values Accuracy F1(macro)
Table 4: Results of K-NN Classifier with Tf-Idf Featurefor Different k-values.Figure 2: Plot of accuracy and F1-score(macro) for dif-ferent values of k. e selected the value of k=5 as our defaultparameter to be further examined with differentpreprocessing and feature combinations (Table 5).The results indicate that the best k-NN model usesthe tf-idf unigram as a feature ( accuracy = 0.479 , F1-macro = 0.318 ). Feature Accuracy F1(macro) unigram + count 0.359 0.172 unigram + tf-idf 0.479 0.318 stopword + tf-idf 0.332 0.133stopword + count 0.342 0.146stopword + tf-idf +n-gram(1,3) 0.316 0.091stopword + count +n-gram(1,3) 0.330 0.114
Table 5: Results of K-NN Classifier for Different Fea-ture Combinations (K=5).
For our second classification algorithm, we usedthe ‘
Multinomial Naive Bayes ’ (MNB) classi-fier. Unlike certain other classifiers, the MNBdid not require setting and tuning the parameters.Thus, we directly experimented with different fea-ture/preprocessing techniques (Table 6).
Feature Accuracy F1(macro) unigram + tf-idf 0.491 0.266 unigram + count 0.525 0.295 stopword + tf-idf 0.472 0.250stopword + Count 0.506 0.284n-gram(1,3) + tf-idf 0.444 0.227n-gram(1,3) + count 0.516 0.287stopword + tf-idf +n-gram(1,3) 0.434 0.219stopword + count +n-gram(1,3) 0.515 0.292
Table 6: Results of Multinomial Na¨ıve Bayes Classifierfor Different Feature Combinations.
Based on the above results, the best MNBmodel was achieved by combining the count withthe unigram feature; an accuracy score of and an F1-macro of were obtained duringthe test.
The DT constructs a regression or classificationmodel by following a tree structure. Here, opti-mal parameter settings were found by considering a minimum samples split of , minimum sampleleaf size of . We did not impose any restrictionson the number of features and to the depth of thetree. Table 7 lists the results of several combina-tions of features and preprocessing schemes. Feature Accuracy F1(macro) unigram + tf-idf 0.442 0.301 unigram + count 0.432 0.287stopword + tf-idf 0.416 0.283stopword + count 0.430 0.292stopword + tf-idf +n-gram(1,3) 0.394 0.247stopword + count +n-gram(1,3) 0.421 0.277
Table 7: Results of Decision Tree Classifier for Differ-ent Feature Combinations.
According to the aforementioned results, thebest DT model with an accuracy of and anF1(macro) of was obtained from the uni-gram and tf-idf combination.
The only unsupervised machine learning approachthat we used in our experiment was the
K-MeansClustering . We selected a cluster size of N=6 , aswe have six different emotion categories. We in-vestigated different initialization ranging from 1 to15. Here, the initialization ( n init ) is the number oftimes the k-means algorithm executes with differ-ent centroid seeds. The final results would be thebest output of n consecutive runs. To evaluate theclustering, we used two measures: Adjusted Rand-Index and
V-measure (similar to F-measure).
Feature AdjustedRand V-measure unigram + tf-idf 0.008 0.042unigram + count 0.009 0.009 n-gram (1,3) + tf-idf 0.059 0.049 n-gram (1,3) + count 0.009 0.011
Table 8: Results of K-Means Clustering Algorithm forDifferent Feature Combinations
Table 8 lists the best evaluation scores for thek-means clustering model (for n init of 1 to 15).From the results, we can see that the highest scoreof in terms of the V-measure and an Ad-justed Rand-Index score of was achieved us-ing a combination of ngram(1,3) and tf-idf feature. .6 Support Vector Machine
To find the best SVM-model, we used both lin-ear and non-linear SVM-kernel. In both cases, themost important words (tf-idf) were used as fea-tures because it leads to the highest performance.We explored different values for
Gamma and
C-parameters and found that the non-linear kernelperformed slightly better. Results and settings forthe SVM model will be discussed in Section 5.7.
Among all models, the best model was the ‘
SVMwith a Non-Linear RBF-Kernel ’. Next, we con-tinued experimenting with different preprocess-ing and feature combinations using this SVMmodel. Table 9 summarizes an overview of dif-ferent combinations of feature and preprocessingtechniques on the non-linear SVM model with anRBF-kernel. The most optimal parameter settingsin this experiment were a C-parameter value of and a Gamma-value of . The best combi-nation of features was the most important ( tf-idf )word unigrams . Therefore, the highest accuracy score achieved from the model was (i.e., animprovement of % from the baseline model)and an
F1(macro) of (i.e., an improvementof 0.2174 from baseline model). Feature Gamma Accuracy F1(macro)
POS + unigram+ tf-idf 0.6 0.399 0.226 unigram + tf-idf 0.6 0.5298 0.3324 unigram + stop-word + tf-idf 0.3 0.517 0.312n-gram(1,3) + tf-idf 0.8 0.516 0.307n-gram(1,3) + tf-idf + POS 0.8 0.399 0.224n-gram(1,3) +stopword + tf-idf 0.4 0.525 0.313n-gram(1,3) +stopword + tf-idf+ POS 0.4 0.374 0.186Feature Union 0.6 0.322 0.087
Table 9: Results of Best Model for Different Features.
Based on this best combination of features listedin Table 9, the detailed results are delineated in Ta-ble 10. Again, we may obtain a clear insight intothe model’s prediction or misclassification from the confusion matrix illustrated in Fig. 3.
Labels Precision Recall F1(micro) angry 0.547 0.585 0.565disgust 0.136 0.030 0.049fear 0.143 0.017 0.030happy 0.645 0.873 0.742sad 0.425 0.535 0.473surprise 0.205 0.100 0.134
Average
Table 10: Detailed Results of Best Model with Uni-gram and Tf-idf as FeaturesFigure 3: Confusion Matrix (Accuracy) for Best SVMmodel with unigram and tf-idf as features
The linguistic motivation behind this project wasinspired by the growing field of computationalresearch in natural languages, particularly in theBangla language processing because Bangla is oneof the most widely spoken languages; it ranked 7thin the world, with staggering number of 268 mil-lion native speakers. However, the computationalmotivation was to compare the contribution of dif-ferent features on the performance of a classifier indoing fine-grained Bangla emotion analysis. Thefindings of this study imply that the SVM modelthat best predicted the aforementioned emotions inBangla text composed of social-media comments,was a model that used a non-linear RBF-kernel,which yielded an accuracy of % and an F1-score of (macro) and (micro). Thesescores showed a significant improvement over the Baseline model, with nearly a % increase in accuracy . Additionally, both the F1 macro andmicro scores increased by and , re-spectively. eferences Sandipan Dandapat, Priyanka Biswas, Monojit Choud-hury, and Kalika Bali. 2009. Complex linguisticannotation—no easy way out!: a case from banglaand hindi pos labeling tasks. In
Proceedings of thethird linguistic annotation workshop , pages 10–18.Association for Computational Linguistics.Dipankar Das. 2011. Analysis and tracking of emo-tions in english and bengali texts: A computationalapproach.
International World Wide Web Confer-ence (IW3C2) .Dipankar Das and Sivaji Bandyopadhyay. 2010a. De-veloping bengali wordnet affect for analyzing emo-tion. In
International Conference on the ComputerProcessing of Oriental Languages , pages 35–40.Dipankar Das and Sivaji Bandyopadhyay. 2010b. La-beling emotion in bengali blog corpus–a fine grainedtagging at sentence level. In
Proceedings of the 8thWorkshop on Asian Language Resources , page 47.Dipankar Das, Sagnik Roy, and Sivaji Bandyopadhyay.2012. Emotion tracking on blogs-a case study forbengali. In
International Conference on Industrial,Engineering and Other Applications of Applied In-telligent Systems , pages 447–456. Springer.P Ekman. 1999. Basic emotions in t. dalgleish and t.power (eds.) the handbook of cognition and emotionpp. 45-60.Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing.
To appear .Jasy Suet Yan Liew and Howard R Turtle. 2016. Ex-ploring fine-grained emotion detection in tweets. In
Proceedings of the NAACL Student Research Work-shop , pages 73–80.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. 2011. Scikit-learn: Machine Learn-ing in Python .