[PDF] Analysis of Emotional Content in Indian Political Speeches

Abstract

Emotions play an essential role in public speaking. The emotional content of speech has the power to influence minds. As such, we present an analysis of the emotional content of politicians speech in the Indian political scenario. We investigate the emotional content present in the speeches of politicians using an Attention based CNN+LSTM network. Experimental evaluations on a dataset of eight Indian politicians shows how politicians incorporate emotions in their speeches to strike a chord with the masses. An analysis of the voting share received along with victory margin and their relation to emotional content in speech of the politicians is also presented.

Full PDF

AAnalysis of Emotional Content in Indian Political Speeches

Sharu Goel ,Sandeep Kumar Pandey , Hanumant Singh Shekhawat , , Indian Institute of Technology Guwahati, India [email protected], [email protected], [email protected]

Abstract

Emotions play an essential role in public speaking. The emo-tional content of speech has the power to inﬂuence minds. Assuch, we present an analysis of the emotional content of politi-cians speech in the Indian political scenario. We investigate theemotional content present in the speeches of politicians usingan Attention based CNN+LSTM network. Experimental evalu-ations on a dataset of eight Indian politicians shows how politi-cians incorporate emotions in their speeches to strike a chordwith the masses. An analysis of the voting share received alongwith victory margin and their relation to emotional content inspeech of the politicians is also presented.

Index Terms :Speech Emotion Recognition, Politician speech,computational paralinguistics, CNN+LSTM, Attention

1. Introduction

Speech signal contains information on two levels. On the ﬁrstlevel, it contains the message to be conveyed and on the sec-ond level, information such as speaker, gender, emotions, etc.Identifying the emotional state of a speech utterance is of im-portance to researchers to make the Human-Computer Interac-tion sound more natural. Also, emotions play a signiﬁcant rolewhen it comes to public speaking, such as a politician address-ing a crowd. Speeches offer politicians an opportunity to setthe agenda, signal their policy preferences, and, among otherthings, strike an emotional chord with the public. We can ob-serve that emotional speeches allow the speaker to connect bet-ter with the masses, instill faith in them, and attract a strong re-sponse. As such, it becomes interesting to analyze which emo-tions and in what proportion do politicians incorporate in theirspeeches.While voting for a particular politician in elections dependson several factors such as background, work done in the pre-vious tenure, public interaction etc, studies done in [1] suggestthat emotions or affects have a negative impact on voting de-cisions of the public. Also, an experiment performed by theUniversity of Massachusetts Amherst revealed that under theemotional state of anger, we are less likely to systematicallyﬁnd information about a candidate and increase our relianceon stereotypes and other pre-conceived notions and heuristics[2]. Another independent study revealed that anger promotesthe propensity to become politically active and hence, has anormatively desirable consequence of an active electorate [3].Furthermore, a study on the 2008 Presidential Elections in theUnited States showed that negative emotions like anger and out-rage have a substantial effect on mobilizing the electorate, andpositive emotions like hope, enthusiasm, and happiness have aweaker mobilizing effect. The researchers excluded emotionslike sadness and anxiety from their study, as these emotions didnot have a noticeable inﬂuence on voting behavior [4]. The re-sults of the previous experiments and studies were backed withthe outcome of the 2016 Presidential Elections in the United States, where the current president Donald Trump had moreanger and hope dominated campaign advertisements comparedto his opposition, Hilary Clinton [5].Nevertheless, a question remains; do emotional manipu-lations elicit the expected emotions? Does an angry speechelicit anger or some other emotion? A study on the struc-ture of emotions[3] in which participants were randomly as-signed to view one of four campaign ads designed to elicit adiscrete emotion revealed that although there is heterogeneity inemotional reactions felt in response to campaign messages, theemotional manipulations from campaign ads did elicit the emo-tions expected. The results of the study showed that sadnesswas most prevalent in the sadness condition relative to all otherconditions, anger was higher in the anger condition relative toall other conditions, and enthusiasm was much more signiﬁ-cant in the enthusiasm condition relative to all other conditions.Nonetheless, the study also illustrated that different emotionalmanipulations were effective in eliciting a speciﬁc emotion. Forinstance, the study showed that sadness was felt in response toboth angry and sad emotional manipulations. Similarly, angerwas felt in response to sad emotional manipulation as well (al-though anger was more prevalent in angry emotional manipula-tion, as stated earlier).Moreover, due to the recent advancements in the deeplearning ﬁeld, SER has seen signiﬁcant improvement in per-formance. In [6], an end-to-end deep convolutional recurrentneural network is used to predict arousal and valence in con-tinuous speech by utilizing two separate sets of 1D CNN lay-ers to extract complementary information. Also, in [7], featuremaps from both time and frequency convolution sub-networksare concatenated, and a class-based and class-agnostic atten-tion pooling is utilized to generate attentive deep features forSER. Experiments on IEMOCAP [8] dataset shows improve-ment over the baseline. Also, researchers have explored thepossibility of extracting discriminative features from the rawspeech itself. In [9], a comparison of the effectiveness of tra-ditional features versus end-to-end learning in atypical affectand crying recognition is presented, only to conclude that thereis no clear winner. Moreover, works in [10] and [11] have alsoutilized raw speech for emotion classiﬁcation.In this work, we propose a deep learning-based architectureto the task of emotional analysis of politicians speech, partic-ularly in the Indian political context. We present a dataset ofspeech utterances collected from speeches of 8 Indian politi-cians. Also, a widely used Attentive CNN+LSTM architectureis used to the task of Speech Emotion Recognition (SER). At theﬁrst level, the speech utterances are classiﬁed into four emotioncategories- Angry, Happy, Neutral, and Sad, using the trainedemotion model. At the second level, analysis of the amount ofdifferent emotions present in a politician’s speech is presented,which gives us a ﬁrst-hand view of the emotional impact ofspeeches on the listeners and the voting decisions. To the best ofour knowledge, this is the ﬁrst work in analyzing the emotional a r X i v : . [ ee ss . A S ] J u l igure 1: Attentive CNN+LSTM architecture contents of speeches of Indian politicians.The remainder of the paper is organized as follows: Section2 describes the data collection strategy and dataset description,along with the perceptual evaluation of the dataset. Section 3describes the Attentive CNN+LSTM architecture in brief. Theexperimental setup and results are discussed in Section 4. Sec-tion 5 summarizes the ﬁndings and concludes the paper.

2. Dataset Description

Since no such speech corpus exists which suits our researchin the Indian political context, we present the IITG PoliticianSpeech Corpus. The Politician Speech Corpus consists of 516audio segments from speeches delivered by Indian Politiciansin Hindi. The speeches are publicly available on YouTubeon the channels of the respective political groups the politi-cians belong to. Eight Indian politicians, whom we nicknamedfor anonymization, namely NI (88 utterances), AH (101 utter-ances), AL (81 utterances), JG (20 utterances), MH (8 utter-ances), RI (31 utterances), RH (122 utterances), SY (65 ut-terances) were considered as they belonged to diverse ideolo-gies and incorporate diverse speaking style. The politicianswere chosen based on the contemporary popularity and wideavailability of speeches. The speeches were addressed in bothindoor(halls or auditoriums) and outdoor(complex grounds,courts, or stadiums) environment and carried substantial noisein the background. Only audio output from a single channel wasconsidered.The extracted audio from video clips is ﬁrst segmented intoutterances of length 7-15 seconds and classiﬁed into four basicemotions - Angry, Happy, Neutral and Sad . Perceptual eval-uation of the utterances are performed by four evaluators. Forcorrectly labelling the speech utterances we followed a two stepstrategy. The ﬁrst step was to identify if there is any emotionalinformation in the utterance, irrespective of the underlying mes-sage. This is done by assigning each utterance a score on ascale of one to ﬁve(one meaning not conﬁdent and ﬁve meaningextremely conﬁdent) based on how conﬁdently the evaluatorswere able to assign it an emotion. The second step was to as-sign a category label to the utterances . Since the speeches frompoliticians exhibit natural emotions, it may happen that it is per-ceived differently by different evaluators. For this we followedthe following strategy for assigning a categorical label - Table 1:

Distribution of utterances emotion wise along withconﬁdence score for each emotion class generated using per-ceptual evaluation

Emotion No.of Samples Conﬁdence Score (1-5) Conﬁdence %

Angry 175 3.5519 71.0381Happy 36 3.2152 64.3055Neutral 230 3.8199 76.3991Sad 75 3.7688 75.3778

1. If the majority of listeners are in favor of one emotion,then the audio clip was assigned that emotion as its la-bel. The conﬁdence score of the label would be the meanscore of the majority.2. If two listeners are in favor of one emotion and two in fa-vor of the other, then the emotion with the highest meanconﬁdence score was assigned as the label. The conﬁ-dence score of the label would be the same mean used tobreak the tie.3. If no consensus is achieved for a particular utterance i.e.all listeners assigned a different label, then that audio clipis discarded.The number of utterances in each emotional category alongwith the conﬁdence score of the evaluators, is presented in Table1. Happy emotion category has the least number of utterancesfollowed by Sad, which is in line with the literature on the emo-tional content of politician speeches. This shows that the classesare highly imbalanced, which adds to the difﬁculty level of thetask at hand.

3. Architecture Description

The deep learning architecture used for the classiﬁcationof emotional states of the utterances is the state-of-the-artCNN+LSTM with attention mechanism [12], [13], as shown inFigure 1. The architecture consists of four local feature learningblocks (LFLB), one LSTM layer for capturing the global depen-dencies, and one attention layer to focus on the emotion relevantigure 2:

Confusion matrix of Attentive CNN+LSTM architec-ture, with an average accuracy of . , where the rows rep-resent the confusion of the ground truth emotion during predic-tion. parts of the utterance followed by a fully-connected softmaxlayer to generate class probabilities.Each LFLB comprises of one convolutional layer, onebatch-normalization layer, one activation layer, and one maxpooling layer. CNN performs local feature learning using 2Dspatial kernels. CNN has the advantage of local spatial con-nectivity and shared weights, which helps the convolution layerto perform kernel learning. Batch Normalization is performedafter the convolution layer to normalize the activations of eachbatch by maintaining the mean activation close to zero and stan-dard deviation close to one. The activation function used is Ex-ponential Linear Unit (ELU). Contrary to other activation func-tions, ELU has negative values too, which pushes the mean ofthe activations closer to zero, thus helping to speed up the learn-ing process and improving performance [14]. Max pooling isused to make the feature maps robust to noise and distortion andalso helps in reducing the number of trainable parameters in thesubsequent layers by reducing the size of the feature maps.Global feature learning is performed using an LSTM layer.The output from the last LFLB is passed on to an LSTM layerto learn the long-term contextual dependencies. Sequences ofhigh-level representation obtained from the CNN+LSTM archi-tecture is passed on to an attention layer whose job is to focus onthe emotion salient parts of the feature maps since not all framescontribute equally to the representation of the speech emotion.The attention layer generates an utterance level representation,obtained by the weighted summation of the high-level sequenceobtained from CNN+LSTM architecture with attention weightsobtained in a trainable fashion [12]. The utterance level atten-tive representations are passed to a fully-connected layer andthen to a softmax layer to map the representations to the differ-ent emotion classes.The number of convolutional kernels in ﬁrst and secondLFLB is 64, and for the third and fourth, LFLB is 128. Thesize of the convolutional kernels is × with a stride of × for all the LFLBs. The size of the kernel for the max-poolinglayer is × for the ﬁrst two LFLB and × for the latter twoLFLBs. The size of the LSTM cells is 128.

4. Experimental Evaluation

The experiments are performed in two stages. The ﬁrst stageof the experiment is to generate class probabilities by perform- (a)(b)(c)

Figure 3:

Distribution of training utterances of IITG PoliticianSpeech Dataset using TSNE with mel-spectrogram as input. 3adisplays the untrained distribution, 3b displays the distributionafter the LSTM layer and 3c displays the distribution after At-tention layer. ing Speech Emotion Recognition (SER) using the architecturediscussed in the above section. The second stage of the experi-mental evaluation is concerned with the analysis of the durationof different emotions in a politician’s speech.For input to the Attentive CNN+LSTM architecture, themel-spectrogram is computed from the raw speech ﬁles. Sincethe input to CNN requires equal-sized ﬁles, the speech ﬁles areeither zero-padded or truncated to the same size. A frame sizeof 25 ms and a frame shift of 10 ms is used to compute themel-spectrograms. Training of the model is performed in a ﬁve-fold cross-validation scheme, and Unweighted Accuracy(UA)is used as the performance measure. The cross-entropy lossfunction is used in adherence with Adam optimizer to train themodel.The model achieves a recognition performance of UAR . with the proposed architecture. Figure 2 presents theconfusion matrix for the predicted emotions of the four emotionigure 4: Percentage of different emotions present in a speech of Indian Politicians of the IITG Political Speech Corpus classes of the IITG Political speech dataset. Due to less numberof samples in the sad emotion category, the recognition perfor-mance is degraded for that category. Happy emotion categoryis clearly recognized by the model. Figure 3 presents the distri-bution of the untrained data, trained data after the LSTM layer,and trained data after the attention layer. The plots show thecapability of the architecture in clustering emotions with con-siderable improvement with the addition of the attention layer.From data and statistics of various National and State elec-tions in India[15, 16, 17, 18], we made the following observa-tions.NI had a vote share of 63.60%, with a victory margin of45.20%. This impressive result can be attributed to his emotion-ally expressive speeches. The emotional content of his speech issigniﬁcant in all four emotional categories. RH had a vote shareof 56.64%, with a victory margin of 31.07%. This success isbacked by his signiﬁcant content of anger emotion and othercontent of few emotions. AH had a vote share of 69.58%, witha victory margin of 43.32%. The massive vote share and mar-gin are credited to his hugely anger dominated speeches, whichcompensates for his lack of happy emotion. JG had a vote shareof 33.70%, with a victory margin of 8.58%. Although sub-stantial anger can be seen in his speeches, the absence of sademotional content and poor happiness content could be the rea-son for his poor vote share and margin. RI had a vote share of64.64%, with a victory margin of 39.51%. The electoral suc-cess of RI is supported by an alloy of angry and sad emotionalcontent in his speeches. However, a lesser quantity of angeremotion and a dearth of happy emotion elucidate the weak mar-gin. AL had a vote share of 61.10%, with a victory margin of28.35%. The high vote share can be a result of well balancedemotional content. SY had a vote share of a meager 1.62%, witha huge defeat margin of 54.39%. An extreme lack of emotionin his oration accounts for this result. MH, with a vote share of46.25%, lost with a defeat margin of 6.00%. Although a consid-erable measure of sad emotion is seen in his speeches, the smallvote share can be attributed to the lack of potent emotional fac-tors like anger or happiness.In summary, emotional content in speeches plays a pivotalrole in mobilizing support and success in elections. Amongstthe four emotional categories, anger was found to be the mostpotent political emotion. This is because, in a political sce-nario, the display of anger in oration make individuals more optimistic about the future, particularly when voters are frus-trated by the absence of desired economic and social outcomeslike rapid economic growth, reduction in poverty, unemploy-ment, corruption, violence, terrorism, etc. Similarly, happinessis also a vital emotion in an electoral context, as it not onlycurbs fear & anxiety but also offers hope for an improved futureand adds to a political candidates appeal. In the Indian context,sad emotion in speeches may manifest through fear and anxietyamongst the masses. Although sad emotion has a role, it is oflesser importance than other emotions. Furthermore, the mosteffective speeches in a political context are those that have amix of all emotions - anger, happiness, and sadness in order ofimportance.Our ﬁndings are in consonance with the earlier studies con-cerning the electoral campaigns in the US and point towards theuniversal application of emotions and their success in politics.SER can, therefore, provide us a metric to gauge the effective-ness of a politicians speech and the likelihood of success andfailure.

5. Conclusions

In this work, we presented a brief analysis of the different emo-tions present in the speeches of politicians in the Indian context.A deep learning framework, CNN+LSTM with attention mech-anism, is used to model the emotions using the IITG PoliticianSpeech Dataset. The Dataset is collected from publicly avail-able speeches of Indian politicians on youtube. The speech ut-terances are divided and annotated by 4 annotators, and a conﬁ-dence score is provided for each emotion category based on theperceptual evaluation. The performance of the model in pre-dicting the emotions from speech utterances outperforms theperceptual evaluation. An assessment of the percentage of dif-ferent emotion classes in each of the politician’s speech is pre-sented with a brief analysis of the results in accordance withprevious results presented in the literature.

6. References [1] W. H. Riker and P. C. Ordeshook, “A theory of the calculus ofvoting,”

American political science review , vol. 62, no. 1, pp. 25–42, 1968.[2] M. T. Parker and L. M. Isbell, “How i vote depends on how i feel:The differential impact of anger and fear on political informationrocessing,”

Psychological Science , vol. 21, no. 4, pp. 548–550,2010.[3] C. Weber, “Emotions, campaigns, and political participation,”

Po-litical Research Quarterly , vol. 66, no. 2, pp. 414–428, 2013.[4] N. A. Valentino, T. Brader, E. W. Groenendyk, K. Gregorowicz,and V. L. Hutchings, “Election nights alright for ﬁghting: Therole of emotions in political participation,”

The Journal of Politics ,vol. 73, no. 1, pp. 156–170, 2011.[5] K. Searles and T. Ridout, “The use and consequences of emotionsin politics,”

Emotion Researcher, ISREs Sourcebook for Researchon Emotion and Affect , 2017.[6] Z. Yang and J. Hirschberg, “Predicting arousal and valence fromwaveforms and spectrograms using deep neural networks.” in

In-terspeech , 2018, pp. 3092–3096.[7] P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L.-R. Dai, “Anattention pooling based representation learning method for speechemotion recognition,” 2018.[8] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap:Interactive emotional dyadic motion capture database,”

Languageresources and evaluation , vol. 42, no. 4, p. 335, 2008.[9] J. Wagner, D. Schiller, A. Seiderer, and E. Andr´e, “Deep learningin paralinguistic recognition tasks: Are hand-crafted features stillrelevant?” in

Interspeech , 2018, pp. 147–151.[10] S. K. Pandey, H. Shekhawat, and S. Prasanna, “Emotion recog-nition from raw speech using wavenet,” in

TENCON 2019-2019IEEE Region 10 Conference (TENCON) . IEEE, 2019, pp. 1292–1297.[11] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma,and N. Dehak, “Emotion identiﬁcation from raw speech signalsusing dnns.” in

Interspeech , 2018, pp. 3097–3101.[12] M. Chen, X. He, J. Yang, and H. Zhang, “3-d convolutional re-current neural networks with attention model for speech emotionrecognition,”

IEEE Signal Processing Letters , vol. 25, no. 10, pp.1440–1444, 2018.[13] J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition usingdeep 1d & 2d cnn lstm networks,”

Biomedical Signal Processingand Control , vol. 47, pp. 312–323, 2019.[14] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and ac-curate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289arXiv preprint arXiv:1511.07289