Multimodal Sentiment Analysis: Addressing Key Issues and Setting up the Baselines
Soujanya Poria, Navonil Majumder, Devamanyu Hazarika, Erik Cambria, Alexander Gelbukh, Amir Hussain
11 Multimodal Sentiment Analysis: Addressing KeyIssues and Setting up the Baselines
Soujanya Poria, Navonil Majumder, Devamanyu Hazarika,Erik Cambria, Alexander Gelbukh and Amir Hussain
Abstract —We compile baselines, along with dataset split,for multimodal sentiment analysis. In this paper, weexplore three different deep-learning based architecturesfor multimodal sentiment classification, each improvingupon the previous. Further, we evaluate these architectureswith multiple datasets with fixed train/test partition. Wealso discuss some major issues, frequently ignored in mul-timodal sentiment analysis research, e.g., role of speaker-exclusive models, importance of different modalities, andgeneralizability. This framework illustrates the differentfacets of analysis to be considered while performingmultimodal sentiment analysis and, hence, serves as a newbenchmark for future research in this emerging field.
I. I
NTRODUCTION
Emotion recognition and sentiment analysis is openingup numerous opportunities pertaining social media interms of understanding users preferences, habits, andtheir contents [11]. With the advancement of commu-nication technology, abundance of mobile devices, andthe rapid rise of social media, a large amount of datais being uploaded as video, rather than text [2]. Forexample, consumers tend to record their opinions onproducts using a webcam and upload them on socialmedia platforms, such as YouTube and Facebook, toinform the subscribers of their views. Such videos oftencontain comparisons of products from competing brands,pros and cons of product specifications, and other infor-mation that can aid prospective buyers to make informeddecisions.The primary advantage of analyzing videos over meretext analysis, for detecting emotions and sentiment, is thesurplus of behavioral cues. Videos provide multimodaldata in terms of vocal and visual modalities. The vocalmodulations and facial expressions in the visual data,along with text data, provide important cues to better
N. Majumder and A. Gelbukh are with the CIC, InstitutoPolit´ecnico Nacional, Mexico City, Mexico.S. Poria and E. Cambria is with the SCSE, Nanyang TechnologicalInstitute, Singapore.D. Hazarika is with the SOC, National University of Singapore,Singapore.A. Hussain is with the Edinburgh Napier University, UK. identify true affective states of the opinion holder. Thus,a combination of text and video data helps to create abetter emotion and sentiment analysis model.Recently, a number of approaches to multimodal sen-timent analysis producing interesting results have beenproposed [12], [14]. However, there are major issuesthat remain mostly unaddressed in this field, such asthe consideration of context in classification, effect ofspeaker-inclusive and speaker-exclusive scenario, the im-pact of each modality across datasets, and generalizationability of a multimodal sentiment classifier. Not tack-ling these issues has presented difficulties in effectivecomparison of different multimodal sentiment analysismethods. In this paper, we outline some methods thataddress these issues and setup a baseline based on state-of-the-art methods. We use a deep convolutional neuralnetwork (CNN) to extract features from visual and textmodalities.This paper is organized as follows: Section II providesa brief literature review on multimodal sentiment analy-sis; Section III briefly discusses the baseline methods;experimental results and discussion are given in Sec-tion IV; finally, Section V concludes the paper.II. R
ELATED W ORK
In 1970, Ekman et al. [6] carried out extensive studieson facial expressions. Their research showed that uni-versal facial expressions are able to provide sufficientclues to detect emotions. Recent studies on speech-basedemotion analysis [4] have focused on identifying relevantacoustic features, such as fundamental frequency (pitch),intensity of utterance, bandwidth, and duration.As to fusing audio and visual modalities for emotionrecognition, two of the early works were done by DeSilva et al. [5] and Chen et al. [3]. Both works showedthat a bimodal system yielded a higher accuracy thanany unimodal system.While there are many research papers on audio-visualfusion for emotion recognition, only a few researchworks have been devoted to multimodal emotion orsentiment analysis using text clues along with visual andaudio modalities. Wollmer et al. [15] fused information a r X i v : . [ c s . C L ] F e b from audio, visual and text modalities to extract emotionand sentiment. Metallinou et al. [9] fused audio andtext modalities for emotion recognition. Both approachesrelied on feature-level fusion.In this paper, we study the behavior of the methodproposed in [13] in the aspects rarely addressed by otherauthors, such as speaker independence, generalizabilityof the models and performance of individual modalities.III. U NIMODAL F EATURE E XTRACTION
For the unimodal feature extraction, we follow theprocedures by bc-LSTM [13].
A. Textual Feature Extraction
We employ convolutional neural networks (CNN) fortextual feature extraction. Following [8], we obtain n-gram features from each utterance using three distinctconvolution filters of sizes 3, 4, and 5 respectively, eachhaving 50 feature-maps. Outputs are then subjected tomax-pooling followed by rectified linear unit (ReLU)activation. These activations are concatenated and fed toa dimensional dense layer, which is regarded as thetextual utterance representation. This network is trainedat utterance level with the emotion labels.
B. Audio and Visual Feature Extraction
Identical to [13], we use 3D-CNN and openSMILE [7]for visual and acoustic feature extraction, respectively.
C. Fusion
In order to fuse the information extracted from differ-ent modalities, we concatenated the feature vectors rep-resentative of the given modalities and sent the combinedvector to a classifier for the classification. This schemeof fusion is called feature-level fusion. Since, the fusioninvolved concatenation and no overlapping, merge, orcombination, scaling and normalization of the featureswere avoided. We discuss the results of this fusion inSection IV.
D. Baseline Method1) bc-LSTM:
We follow the method bc-LSTM [13]where they used a biredectional LSTM to capture thecontext from the surrounding utterances to generatecontext-aware utterance representation.
2) SVM:
After extracting the features, we mergedand sent to a SVM with RBF kernel for the finalclassification. IV. E
XPERIMENTS AND O BSERVATIONS
In this section, we discuss the datasets and the exper-imental settings. Also, we analyze the results yielded bythe aforementioned methods.
A. Datasets1) Multimodal Sentiment Analysis Datasets:
For ourexperiments, we used the MOUD dataset, developedby Perez-Rosas et al. [10]. They collected 80 productreview and recommendation videos from YouTube. Eachvideo was segmented into its utterances (498 in total)and each of these was categorized by a sentiment la-bel (positive, negative and neutral). On average, eachvideo has 6 utterances and each utterance is 5 secondslong. In our experiment, we did not consider neutrallabels, which led to the final dataset consisting of 448utterances. We dropped the neutral label to maintainconsistency with previous work. In a similar fashion,Zadeh et al. [16] constructed a multimodal sentimentanalysis dataset called multimodal opinion-level senti-ment intensity (MOSI), which is bigger than MOUD,consisting of 2199 opinionated utterances, 93 videos by89 speakers. The videos address a large array of topics,such as movies, books, and products. In the experimentto address the generalizability issues, we trained a modelon MOSI and tested on MOUD. Table I shows the splitof train/test of these datasets.
2) Multimodal Emotion Recognition Dataset:
TheIEMOCAP database [1] was collected for the purposeof studying multimodal expressive dyadic interactions.This dataset contains 12 hours of video data split into 5minutes of dyadic interaction between professional maleand female actors. Each interaction session was split intospoken utterances. At least 3 annotators assigned to eachutterance one emotion category: happy, sad, neutral,angry, surprised, excited, frustration, disgust, fear and other . In this work, we considered only the utteranceswith majority agreement (i.e., at least two out of threeannotators labeled the same emotion) in the emotionclasses of angry , happy , sad , and neutral . Table I showsthe split of train/test of this dataset. B. Speaker-Exclusive Experiment
Most of the research on multimodal sentiment analysisis performed with datasets having common speaker(s)between train and test splits. However, given this overlap,results do not scale to true generalization. In real-worldapplications, the model should be robust to speakervariance. Thus, we performed speaker-exclusive exper-iments to emulate unseen conditions. This time, ourtrain/test splits of the datasets were completely disjoint
Dataset Train Test utterance video utterance video
IEMOCAP 4290 120 1208 31MOSI 1447 62 752 31MOUD 322 59 115 20MOSI → MOUD 2199 93 437 79
TABLE I: Person-Independent Train/Test split details ofeach dataset ( ≈ → Y representstrain: X and test: Y; Validation sets are extracted fromthe shuffled train sets using 80/20 % train/val ratio.with respect to speakers. While testing, our modelshad to classify emotions and sentiments from utterancesby speakers they have never seen before. Below, weelaborate this speaker-exclusive experiment: • IEMOCAP:
As this dataset contains 10 speak-ers, we performed a 10-fold speaker-exclusive test,where in each round exactly one of the speakerswas included in the test set and missing from trainset. The same SVM model was used as before andaccuracy was used as performance metric. • MOUD:
This dataset contains videos of about 80people reviewing various products in Spanish. Eachutterance in the video has been labeled as posi-tive, negative , or neutral . In our experiments, weconsider only samples with positive and negative sentiment labels. The speakers were partitioned into5 groups and a 5-fold person-exclusive experimentwas performed, where in every fold one out of thefive group was in the test set. Finally, we tookaverage of the accuracy to summarize the results(Table II). • MOSI:
MOSI dataset is rich in sentimental expres-sions, where 93 people review various products inEnglish. The videos are segmented into clips, whereeach clip is assigned a sentiment score between − to +3 by five annotators. We took the average ofthese labels as the sentiment polarity and naturallyconsidered two classes ( positive and negative ). LikeMOUD, speakers were divided into five groups anda 5-fold person-exclusive experiment was run. Foreach fold, on average 75 people were in the trainingset and the remaining in the test set. The trainingset was further partitioned and shuffled into 80%–20% split to generate train and validation sets forparameter tuning.
1) Speaker-Inclusive vs. Speaker-Exclusive:
In com-parison with the speaker-inclusive experiment, thespeaker-exclusive setting yielded inferior results. This iscaused by the absence of knowledge about the speakers during the testing phase. Table II shows the performanceobtained in the speaker-inclusive experiment. It can beseen that audio modality consistently performs betterthan visual modality in both MOSI and IEMOCAPdatasets. The text modality plays the most importantrole in both emotion recognition and sentiment analysis.The fusion of the modalities shows more impact foremotion recognition than for sentiment analysis. Rootmean square error (RMSE) and TP-rate of the exper-iments using different modalities on IEMOCAP andMOSI datasets are shown in Fig. 1.Fig. 1: Experiments on IEMOCAP and MOSI datasets.The top-left figure shows the RMSE of the models onIEMOCAP and MOSI. The top-right figure shows thedataset distribution. Bottom-left and bottom-right figurespresent TP-rate on of the models on IEMOCAP andMOSI dataset, respectively.
C. Contributions of the Modalities
As expected, bimodal, and trimodal models have per-formed better than unimodal models in all experiments.Overall, audio modality has performed better than visualon all datasets. Except for MOUD dataset, the unimodalperformance of text modality is substantially better thanother two modalities (Fig. 2).
D. Generalizability of the Models
To test the generalization ability of the models, wetrained the framework on MOSI dataset in speaker-exclusive fashion and tested with MOUD dataset. FromTable III, we can see that the trained model with MOSIdataset performed poorly with MOUD dataset.This is mainly due to the fact that reviews in MOUDdataset had been recorded in Spanish, so both audio andtext modalities miserably fail in recognition, as MOSIdataset contains reviews in English. A more compre-hensive study would be to perform generalizability tests
ModalityCombination IEMOCAP MOUD MOSISp-In Sp-Ex Sp-In Sp-Ex Sp-In Sp-ExA 66.20 51.52 – 53.70 64.00 57.14V 60.30 41.79 – 47.68 62.11 58.46T 67.90 65.13 – 48.40 78.00 75.16T + A 78.20 70.79 – 57.10 76.60 75.72T + V 76.30 68.55 – 49.22 78.80 75.06A + V 73.90 52.15 – 62.88 66.65 62.4T + A + V – TABLE II: Accuracy reported for speaker-exclusive (Sp-Ex) and speaker-inclusive (Sp-In) split for Concatenation-Based Fusion.
IEMOCAP:
MOUD:
MOSI:
Legend:
A stands for Audio, V for Video, T for Text.Fig. 2: Performance of the modalities on the datasets.Red line indicates the median of the accuracy.
Modality Combination AccuracySVM bc-LSTMT 46.5%
V 43.3%
A 42.9%
T + A 50.4%
T + V 49.8%
A + V 46.0%
T + A + V 51.1%
TABLE III:
Cross-dataset results:
Model (with previousconfigurations) trained on MOSI dataset and tested onMOUD dataset.on datasets of the same language. However, we wereunable to do this for the lack of benchmark datasets.Also, similar experiments of cross-dataset generalizationwas not performed on emotion detection, given theavailability of only a single dataset (IEMOCAP).
E. Comparison among the Baseline Methods
Table IV consolidates and compares performance ofall the baseline methods for all the datasets. We evaluated SVM and bc-LSTM fusion with MOSI, MOUD, andIEMOCAP dataset.From Table IV, it is clear that bc-LSTM performsbetter than SVM across all the experiments. So, it is veryapparent that consideration of context in the classifica-tion process has substantially boosted the performance.
F. Visualization of the Datasets
MOSI visualizations present information regardingdataset distribution within single and multiple modalities(Fig. 3). For the textual and audio modalities, compre-hensive clustering can be seen with substantial overlap.However, this problem is reduced in the video andall modalities scenario with structured declustering butoverlap is reduced only in multimodal. This forms anintuitive explanation of the improved performance in themultimodal scenario. IEMOCAP visualizations provideinsight for the 4-class distribution for uni and multimodalscenario, where clearly the multimodal distribution hasthe least overlap (increase in red and blue visuals,apart from the rest) with sparse distribution aiding theclassification process.V. C
ONCLUSION
We have presented useful baselines for multimodalsentiment analysis and multimodal emotion recognition.We also discussed some major aspects of multimodalsentiment analysis problem, such as the performancein the unknown-speaker setting and the cross-datasetperformance of the models.Our future work will focus on extracting semanticsfrom the visual features, relatedness of the cross-modalfeatures and their fusion. We will also include contextualdependency learning in our model to overcome thelimitations mentioned in the previous section.
ModalityCombination IEMOCAP MOUD MOSISVM bc-LSTM SVM bc-LSTM SVM bc-LSTMA 52.9
V 47.0
T 65.5
T + A 70.1
T + V 68.5
TABLE IV: Accuracy reported for speaker-exclusive classification.
IEMOCAP:
MOUD:
MOSI:
Legend:
A represents Audio, Vrepresents Video, T represents Text.
15 10 5 0 5 10 1515105051015
Text
20 10 0 10 20201001020
Audio
20 10 0 10 20201001020
Video
20 10 0 10 20201001020
All
NegativePositive
15 10 5 0 5 10 1515105051015
Text
10 5 0 5 101050510
Audio
15 10 5 0 5 10 1515105051015
Video
10 5 0 5 10105051015
All
HappySadNeutralAnger
Fig. 3: T-SNE 2D visualization of MOSI and IEMOCAPdatasets when unimodal features and multimodal featuresare used. R
EFERENCES [1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: In- teractive emotional dyadic motion capture database.
Languageresources and evaluation , 42(4):335–359, 2008.[2] E. Cambria, H. Wang, and B. White. Guest editorial: Big socialdata analysis.
Knowledge-Based Systems , 69:1–2, 2014.[3] L. S. Chen, T. S. Huang, T. Miyasato, and R. Nakatsu. Multi-modal human emotion/expression recognition. In
Proceedingsof the Third IEEE International Conference on Automatic Faceand Gesture Recognition , pages 366–371. IEEE, 1998.[4] D. Datcu and L. Rothkrantz. Semantic audio-visual data fusionfor automatic emotion recognition.
Euromedia , 2008.[5] L. C. De Silva, T. Miyasato, and R. Nakatsu. Facial emotionrecognition using multi-modal information. In
Proceedings ofICICS , volume 1, pages 397–401. IEEE, 1997.[6] P. Ekman. Universal facial expressions of emotion.
Culture andPersonality: Contemporary Readings/Chicago , 1974.[7] F. Eyben, M. W¨ollmer, and B. Schuller. Opensmile: themunich versatile and fast open-source audio feature extractor.In
Proceedings of the 18th ACM international conference onMultimedia , pages 1459–1462. ACM, 2010.[8] Y. Kim. Convolutional neural networks for sentence classifica-tion.
CoRR , abs/1408.5882, 2014.[9] A. Metallinou, S. Lee, and S. Narayanan. Audio-visual emotionrecognition using gaussian mixture models for face and voice.In
Tenth IEEE International Symposium on ISM 2008 , pages250–257. IEEE, 2008.[10] V. P´erez-Rosas, R. Mihalcea, and L.-P. Morency. Utterance-level multimodal sentiment analysis. In
ACL , pages 973–982,2013.[11] S. Poria, E. Cambria, R. Bajpai, and A. Hussain. A reviewof affective computing: From unimodal analysis to multimodalfusion.
Information Fusion , 37:98–125, 2017.[12] S. Poria, E. Cambria, and A. Gelbukh. Deep convolutionalneural network textual features and multiple kernel learning forutterance-level multimodal sentiment analysis. In
Proceedingsof EMNLP , pages 2539–2544, 2015.[13] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, andL.-P. Morency. Context-dependent sentiment analysis in user-generated videos. In
Proceedings of the 55th Annual Meetingof the Association for Computational Linguistics (Volume 1:Long Papers) , pages 873–883, Vancouver, Canada, July 2017.Association for Computational Linguistics.[14] S. Poria, I. Chaturvedi, E. Cambria, and A. Hussain. Con-volutional MKL based multimodal emotion recognition andsentiment analysis. In
ICDM , pages 439–448, Barcelona, 2016.[15] M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun,K. Sagae, and L.-P. Morency. Youtube movie reviews: Sen-timent analysis in an audio-visual context.
IEEE IntelligentSystems , 28(3):46–53, 2013.[16] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency. Multimodal sentiment intensity analysis in videos: Facial gestures and verbalmessages.