[PDF] Comparison of Classical Machine Learning Approaches on Bangla Textual Emotion Analysis

Abstract

Detecting emotions from text is an extension of simple sentiment polarity detection. Instead of considering only positive or negative sentiments, emotions are conveyed using more tangible manner; thus, they can be expressed as many shades of gray. This paper manifests the results of our experimentation for fine-grained emotion analysis on Bangla text. We gathered and annotated a text corpus consisting of user comments from several Facebook groups regarding socio-economic and political issues, and we made efforts to extract the basic emotions (sadness, happiness, disgust, surprise, fear, anger) conveyed through these comments. Finally, we compared the results of the five most popular classical machine learning techniques namely Naive Bayes, Decision Tree, k-Nearest Neighbor (k-NN), Support Vector Machine (SVM) and K-Means Clustering with several combinations of features. Our best model (SVM with a non-linear radial-basis function (RBF) kernel) achieved an overall average accuracy score of 52.98% and an F1 score (macro) of 0.3324

Full PDF

CComparison of Classical Machine Learning Approaches on BanglaTextual Emotion Analysis

Md. Ataur Rahman

Language Science and TechnologyUniversity of SaarlandSaarbr¨ucken, Germany [email protected]

Md. Hanif Seddiqui

Computer Science and EngineeringUniversity of ChittagongChittagong, Bangladesh [email protected]

Abstract

Detecting emotions from text is an extensionof simple sentiment polarity detection. Insteadof considering only positive or negative senti-ments, emotions are conveyed using more tan-gible manner; thus, they can be expressed asmany shades of gray. This paper manifeststhe results of our experimentation for ﬁne-grained emotion analysis on Bangla text. Wegathered and annotated a text corpus consist-ing of user comments from several Facebookgroups regarding socio-economic and polit-ical issues, and we made efforts to extractthe basic emotions (sadness, happiness, dis-gust, surprise, fear, anger) conveyed throughthese comments. Finally, we compared the re-sults of the ﬁve most popular classical machinelearning techniques namely Na¨ıve Bayes, De-cision Tree, k-Nearest Neighbor (k-NN), Sup-port Vector Machine (SVM) and K-MeansClustering with several combinations of fea-tures. Our best model (SVM with a non-linearradial-basis function (RBF) kernel) achievedan overall average accuracy score of %and an F1 score (macro) of . Sentiment analysis or opinion mining is the taskof automatically analyzing text documents usingcomputational methods to obtain the opinions ofthe authors about speciﬁc entities, such as peo-ple, companies, events or products. At present,the web has become an excellent source of opin-ions about entities, particularly with the increasedpopularity of social media. People are express-ing their opinions through reviews, forum discus-sions, blogs, tweets, comments and posts. Individ-uals and organizations are increasingly using theseopinions for decision-making purposes.Thus we took an initiative that aims at devel-oping and annotating a text corpus in Bangla fordoing ﬁne-grained emotion analysis. The term

Emotion Analysis is used because instead of di-viding the corpus based on only positive and neg-ative sentiments, we tried to consider more ﬁne-grained emotion labels such as sadness, happi-ness, disgust, surprise, fear and anger - whichare, according to Paul Ekman (1999), the six ba-sic emotion categories. Next, we tried to imple-ment ﬁve different classical machine learning al-gorithms, namely the Na¨ıve Bayes, Decision Tree,k-Nearest Neighbours, Support Vector Machineand K-Means clustering on our corpus. Thus thecontributions of this paper can brieﬂy be seen inthree major folds:1. We will present a manually annotated Banglaemotion corpus, which incorporates the di-versity of ﬁne-grained emotion expressionsin social-media text.2. We will employ classical machine-learningapproaches that typically perform well inclassifying the six aforementioned emotiontypes.3. We will compare the machine-learningclassiers performance with a baseline toidentify the best-performing model for ﬁne-grained emotion classiﬁcation.Using our own carefully curated gold standardcorpus, we will report our preliminary efforts totrain and evaluate machine learning models foremotion classiﬁcation in Bangla text. Our ex-perimental results show that a non-linear SVMachieved the best performance with an accuracyscore of , and an F-score of (macro)and (micro) among all the tested classiﬁers.

To our knowledge, the reliable literature on ﬁne-grained emotion tagging for Bangla is very lim-ited. For example, a reliable research work wasthat of Das and Bandyopadhyay (2010b). In their a r X i v : . [ c s . C L ] J u l ork, they annotated a random collection of 123blog posts consisting a total of 12,149 sentences.The task was mainly focused on observing the per-formance of different machine learning classiﬁers.On a small subset of 200 test sentences, the Con-ditional Random Field (CRF) achieved an averageaccuracy of whereas the SVM scored .In a different paper (Das and Bandyopadhyay,2010a), the authors described the preparation ofthe Bengali WordNet Affect containing six typesof emotion words. They employed an automaticmethod of sense disambiguation. The BengaliWordNet Affect could be useful for emotion-related language processing tasks in Bengali.In his paper, Das (2011) delineates the identi-ﬁcation of the emotional contents at a document-level along with their associated holders and top-ics. Additionally, he manually annotated a smallcorpus. By applying sense-based affect-estimationtechniques, he gained a micro F-score of and in terms of ‘emotion holder’ and ‘emotiontopic’ identiﬁcation.On a case study for Bengali (Das et al., 2012),the authors considered 1,100 sentences on eightdifferent topics. They prepared a knowledge basefor emoticons and also employed a morphologi-cal analyzer to identify the lexical keywords fromthe Bengali WordNet Affect lists. They claimedan overall precision, recall and F1-Score (micro)of , and respectively.Jasy and Howard (2016) investigated certainprevalent machine learning techniques on coarse-grained emotions for English. They used thegrounded-theory method to construct a corpus of5,553 tweets, manually annotated with 28 emotioncategories. They showed that SVM and BayesNetoutperformed all the classiﬁers. The BayesNetcorrectly predicted roughly 60% of the instances,whereas the SVM was correct on 50% of the cases. We used two different datasets in our experiment.The ﬁrst was the

Part-of-Speech (POS) Tagset:Bengali (Dandapat et al., 2009) for POS-tagging.The dataset that we were able to obtain containedapproximately 3K sentences and 42K words in it’soriginal form, and it contained a broad category of32 tagsets . https://github.com/abhishekgupta92/bangla_pos_tagger/tree/master/data For the task of the Bangla emotion classiﬁca-tion, we annotated 6,314 comments from three dif-ferent Facebook groups. These comments weremostly reactions to ongoing socio-political is-sues and concerned the success and failure of theBangladesh government.

For the purpose of our experiment, we took a bal-anced set from the aforementioned data and di-vided it into a training and a test set of an equalratio. We considered a proportion of 5:1 for train-ing and evaluation purposes. Table 1 summarizesthis distribution.

Labels Training Set Testing Set sad 1000 200happy 1500 300disgust 500 100surprise 400 80fear 300 60angry 1000 200

Total

Table 1: Distribution of emotion classes in the dataset.

In this section, we will describe the preprocess-ing and feature-selection techniques, including thePOS tagging approach that we have considered forthe emotion recognition models. Finally, the base-line setting for evaluation and further model opti-mization will be introduced.

Apart from cleaning the data, we also used cer-tain simple text preprocessing techniques. We to-kenized the words using a specialized tokenizerfor Bangla from spaCy (Honnibal and Montani,2017). Moreover, we experimented by ﬁltering outstop words.We have explored two types of feature vectorsnamely the count vectorizer and a tf-idf vectorizerwith a combination of n-grams (ranging from un-igrams to trigrams) from scikit-learn (Pedregosaet al., 2011). Furthermore, we investigated the ef-fect of POS tagging for feature reduction on ourbest model. .2 POS Tagging

For the purpose of POS tagging, we implementeda H idden Markov Model (HMM) based tagger.The original POS tagger is capable of tagging 32tags with an accuracy of % over the Bangladataset (Section 3). By looking upon several com-binations, we only considered ﬁve tags ( ‘JJ’, ‘CX’,‘VM’, ‘NP’ and ‘AMN’ ) that were the most signif-icant to emotion-related words. As baseline measure, we used a k-NN classi-ﬁer with word unigrams plus count as features.The value of k-nearest neighbours was set to 15( k=15 ). For evaluation, we will compare the re-sults of this baseline model with the optimizedmodel for each of the classiﬁers in Section 5.

For the evaluation, we will ﬁrst delineate the re-sults of our baseline classiﬁer (k-NN). Then, wewill attempt to ﬁnd the best model based on theresults obtained in Section 5.1 through 5.6.

Table 2 lists in detail the results of the baselinemodel, whereas Table 3 summarizes the overallaccuracy and the F1(macro) score.

Labels Precision Recall F1(micro) angry 0.125 0.020 0.034disgust 1.000 0.010 0.020fear 0.000 0.000 0.000happy sad 0.421 0.040 0.073surprise 0.125 0.037 0.058

Average

Table 2: Results of K-NN Classiﬁer as the BaselineModel with

K=15 . Evaluation Matrix Score

Accuracy

F1(macro)

Table 3: Average score of the Baseline Model.

From Table 2 and Fig. 1, it may be observed thatthe baseline classiﬁer predicts almost every classas being ‘happy’ . This could be the result of theclassiﬁer being biased towards this particular la-bel because the maximum number in the trainingexample was supplied for the category ‘happy’ . Figure 1: Confusion Matrix (Accuracy) for Baseline.

Although the baseline k-NN model performedquite poorly, we attempted to tune the parametersfor the K-NN classiﬁer to identify the best k-valuefor our data. Table 4 and Fig. 2 presents the re-sults of the k-NN classiﬁer for various k-values. Itshould be noted that here, we only considered the tf-idf feature because it yielded better results thanthe count feature. Considering the data and theplot (Fig 2), it is obvious that the classiﬁer pro-duces the best outputs for the k value of . K-Values Accuracy F1(macro)

Table 4: Results of K-NN Classiﬁer with Tf-Idf Featurefor Different k-values.Figure 2: Plot of accuracy and F1-score(macro) for dif-ferent values of k. e selected the value of k=5 as our defaultparameter to be further examined with differentpreprocessing and feature combinations (Table 5).The results indicate that the best k-NN model usesthe tf-idf unigram as a feature ( accuracy = 0.479 , F1-macro = 0.318 ). Feature Accuracy F1(macro) unigram + count 0.359 0.172 unigram + tf-idf 0.479 0.318 stopword + tf-idf 0.332 0.133stopword + count 0.342 0.146stopword + tf-idf +n-gram(1,3) 0.316 0.091stopword + count +n-gram(1,3) 0.330 0.114

Table 5: Results of K-NN Classiﬁer for Different Fea-ture Combinations (K=5).

For our second classiﬁcation algorithm, we usedthe ‘

Multinomial Naive Bayes ’ (MNB) classi-ﬁer. Unlike certain other classiﬁers, the MNBdid not require setting and tuning the parameters.Thus, we directly experimented with different fea-ture/preprocessing techniques (Table 6).

Feature Accuracy F1(macro) unigram + tf-idf 0.491 0.266 unigram + count 0.525 0.295 stopword + tf-idf 0.472 0.250stopword + Count 0.506 0.284n-gram(1,3) + tf-idf 0.444 0.227n-gram(1,3) + count 0.516 0.287stopword + tf-idf +n-gram(1,3) 0.434 0.219stopword + count +n-gram(1,3) 0.515 0.292

Table 6: Results of Multinomial Na¨ıve Bayes Classiﬁerfor Different Feature Combinations.

Based on the above results, the best MNBmodel was achieved by combining the count withthe unigram feature; an accuracy score of and an F1-macro of were obtained duringthe test.

The DT constructs a regression or classiﬁcationmodel by following a tree structure. Here, opti-mal parameter settings were found by considering a minimum samples split of , minimum sampleleaf size of . We did not impose any restrictionson the number of features and to the depth of thetree. Table 7 lists the results of several combina-tions of features and preprocessing schemes. Feature Accuracy F1(macro) unigram + tf-idf 0.442 0.301 unigram + count 0.432 0.287stopword + tf-idf 0.416 0.283stopword + count 0.430 0.292stopword + tf-idf +n-gram(1,3) 0.394 0.247stopword + count +n-gram(1,3) 0.421 0.277

Table 7: Results of Decision Tree Classiﬁer for Differ-ent Feature Combinations.

According to the aforementioned results, thebest DT model with an accuracy of and anF1(macro) of was obtained from the uni-gram and tf-idf combination.

The only unsupervised machine learning approachthat we used in our experiment was the

K-MeansClustering . We selected a cluster size of N=6 , aswe have six different emotion categories. We in-vestigated different initialization ranging from 1 to15. Here, the initialization ( n init ) is the number oftimes the k-means algorithm executes with differ-ent centroid seeds. The ﬁnal results would be thebest output of n consecutive runs. To evaluate theclustering, we used two measures: Adjusted Rand-Index and

V-measure (similar to F-measure).

Feature AdjustedRand V-measure unigram + tf-idf 0.008 0.042unigram + count 0.009 0.009 n-gram (1,3) + tf-idf 0.059 0.049 n-gram (1,3) + count 0.009 0.011

Table 8: Results of K-Means Clustering Algorithm forDifferent Feature Combinations

Table 8 lists the best evaluation scores for thek-means clustering model (for n init of 1 to 15).From the results, we can see that the highest scoreof in terms of the V-measure and an Ad-justed Rand-Index score of was achieved us-ing a combination of ngram(1,3) and tf-idf feature. .6 Support Vector Machine

To ﬁnd the best SVM-model, we used both lin-ear and non-linear SVM-kernel. In both cases, themost important words (tf-idf) were used as fea-tures because it leads to the highest performance.We explored different values for

Gamma and

C-parameters and found that the non-linear kernelperformed slightly better. Results and settings forthe SVM model will be discussed in Section 5.7.

Among all models, the best model was the ‘

SVMwith a Non-Linear RBF-Kernel ’. Next, we con-tinued experimenting with different preprocess-ing and feature combinations using this SVMmodel. Table 9 summarizes an overview of dif-ferent combinations of feature and preprocessingtechniques on the non-linear SVM model with anRBF-kernel. The most optimal parameter settingsin this experiment were a C-parameter value of and a Gamma-value of . The best combi-nation of features was the most important ( tf-idf )word unigrams . Therefore, the highest accuracy score achieved from the model was (i.e., animprovement of % from the baseline model)and an

F1(macro) of (i.e., an improvementof 0.2174 from baseline model). Feature Gamma Accuracy F1(macro)

POS + unigram+ tf-idf 0.6 0.399 0.226 unigram + tf-idf 0.6 0.5298 0.3324 unigram + stop-word + tf-idf 0.3 0.517 0.312n-gram(1,3) + tf-idf 0.8 0.516 0.307n-gram(1,3) + tf-idf + POS 0.8 0.399 0.224n-gram(1,3) +stopword + tf-idf 0.4 0.525 0.313n-gram(1,3) +stopword + tf-idf+ POS 0.4 0.374 0.186Feature Union 0.6 0.322 0.087

Table 9: Results of Best Model for Different Features.

Based on this best combination of features listedin Table 9, the detailed results are delineated in Ta-ble 10. Again, we may obtain a clear insight intothe model’s prediction or misclassiﬁcation from the confusion matrix illustrated in Fig. 3.

Labels Precision Recall F1(micro) angry 0.547 0.585 0.565disgust 0.136 0.030 0.049fear 0.143 0.017 0.030happy 0.645 0.873 0.742sad 0.425 0.535 0.473surprise 0.205 0.100 0.134

Average

Table 10: Detailed Results of Best Model with Uni-gram and Tf-idf as FeaturesFigure 3: Confusion Matrix (Accuracy) for Best SVMmodel with unigram and tf-idf as features

The linguistic motivation behind this project wasinspired by the growing ﬁeld of computationalresearch in natural languages, particularly in theBangla language processing because Bangla is oneof the most widely spoken languages; it ranked 7thin the world, with staggering number of 268 mil-lion native speakers. However, the computationalmotivation was to compare the contribution of dif-ferent features on the performance of a classiﬁer indoing ﬁne-grained Bangla emotion analysis. Theﬁndings of this study imply that the SVM modelthat best predicted the aforementioned emotions inBangla text composed of social-media comments,was a model that used a non-linear RBF-kernel,which yielded an accuracy of % and an F1-score of (macro) and (micro). Thesescores showed a signiﬁcant improvement over the Baseline model, with nearly a % increase in accuracy . Additionally, both the F1 macro andmicro scores increased by and , re-spectively. eferences Sandipan Dandapat, Priyanka Biswas, Monojit Choud-hury, and Kalika Bali. 2009. Complex linguisticannotation—no easy way out!: a case from banglaand hindi pos labeling tasks. In

Proceedings of thethird linguistic annotation workshop , pages 10–18.Association for Computational Linguistics.Dipankar Das. 2011. Analysis and tracking of emo-tions in english and bengali texts: A computationalapproach.

International World Wide Web Confer-ence (IW3C2) .Dipankar Das and Sivaji Bandyopadhyay. 2010a. De-veloping bengali wordnet affect for analyzing emo-tion. In

International Conference on the ComputerProcessing of Oriental Languages , pages 35–40.Dipankar Das and Sivaji Bandyopadhyay. 2010b. La-beling emotion in bengali blog corpus–a ﬁne grainedtagging at sentence level. In

Proceedings of the 8thWorkshop on Asian Language Resources , page 47.Dipankar Das, Sagnik Roy, and Sivaji Bandyopadhyay.2012. Emotion tracking on blogs-a case study forbengali. In

International Conference on Industrial,Engineering and Other Applications of Applied In-telligent Systems , pages 447–456. Springer.P Ekman. 1999. Basic emotions in t. dalgleish and t.power (eds.) the handbook of cognition and emotionpp. 45-60.Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing.

To appear .Jasy Suet Yan Liew and Howard R Turtle. 2016. Ex-ploring ﬁne-grained emotion detection in tweets. In

Proceedings of the NAACL Student Research Work-shop , pages 73–80.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. 2011. Scikit-learn: Machine Learn-ing in Python .