A Deep Multi-Level Attentive network for Multimodal Sentiment Analysis
A Deep Multi-Level Attentive network for Multimodal Sentiment Analysis
Ashima Yadav , Dinesh Kumar Vishwakarma Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, Delhi, India. E-mail: [email protected], [email protected] Abstract — Multimodal sentiment analysis has attracted increasing attention with broad application prospects. The existing methods focuses on single modality, which fails to capture the social media content for multiple modalities. Moreover, in multi-modal learning, most of the works have focused on simply combining the two modalities, without exploring the complicated correlations between them. This resulted in dissatisfying performance for multimodal sentiment classification. Motivated by the status quo, we propose a Deep Multi-Level Attentive network (DMLANet), which exploits the correlation between image and text modalities to improve multimodal learning. Specifically, we generate the bi-attentive visual map along the spatial and channel dimensions to magnify CNN’s representation power. Then we model the correlation between the image regions and semantics of the word by extracting the textual features related to the bi-attentive visual features by applying semantic attention. Finally, self-attention is employed to automatically fetch the sentiment-rich multimodal features for the classification. We conduct extensive evaluations on four real-world datasets, namely, MVSA-Single, MVSA-Multiple, Flickr, and Getty Images, which verifies the superiority of our method.
Index Terms — Attention, Deep learning, Multimedia, Multimodal recognition, Sentiment analysis, Social networking.
1 I
NTRODUCTION
HE rapid popularity of social media has contributed to the enormous amount of multimedia data containing visual contents and textual descriptions. People are contin-uously expressing their emotions and opinions on social networking sites like Twitter, Flickr, etc. [1] . Sentiment analysis [2] of multimodal data is crucial for understand-ing the attitude and behavior of people in many real-world applications like healthcare [3] , politics [4] , cinematog-raphy [5] , and business analysis [6] . Hence, automatically detecting the sentiments from visual and textual contents has emerged as a significant research problem. Previous multimodal sentiment analysis works have concentrated on early fusion-based techniques that com-bine the different features from multiple sources and feed them into the classifier [7] . Some have predicted the senti-ments from multiple sources and then aggregated the re-sults to get the multimodal sentiment label [8] . This is known as late fusion. The major drawback of these works is that they fail to capture the complex correlation between the modalities. Similarly, some works have employed in-termediate fusion, which combines the modalities in the intermediate layers of the network [9] . However, this re-quires a careful design approach and may not perform well if a portion of multimodal content is incomplete. Although significant efforts have been made in this area, predicting the multimodal sentiments remains an open problem due to the following reasons. Firstly, each modality has its individual characteristics and is ex-pressed differently by the human cognitive system [10] . Hence, it isn’t easy to deal with such heterogeneous con-tent for multimodal analysis. Therefore, the correlations between the image and text descriptions needs to be cap-tured effectively to bridge this gap. Secondly, some of the existing works have focused on the attention mechanism to capture the crucial image regions and sentimental words for generating efficient multimodal features. How- ever, these approaches only focus on region-based atten-tion and do not utilize channel information for developing visual features. This is important because the channel-based attention helps to identify the crucial patterns in the given image. Thirdly, most works fail to utilize image-level features to highlight the sentimental words, which could enhance the performance for multimodal sentiment classification. Hence, to tackle the above challenges, we propose a Deep Multi-level Attentive network (DMLANet) by intro-ducing the channel-level and region-level attention to generate a bi-attentive visual feature map that enhances the representation power of the CNN. After that, we uti-lize the semantic attention to extract the attended textual features corresponding to the bi-attentive visual features. Next, we concatenate the attended text features with bi-attentive visual features to obtain the multimodal fea-tures, which are passed into the self-attention network, weighing all the multimodal features and extracting only the significant sentiment-rich features for the classifica-tion. Thus, our model exploits the fine-grained correla-tions between the image and text descriptions leading to effective classification results. The significant contribution can be summarized as follows: A Deep Multi-level Attentive network (DMLANet) is developed, which generates the discriminative features from the visual and textual descriptions by introducing attention at multiple levels. We obtain the bi-attentive visual features by exploiting the channel attention and spatial attention in the visual data. To capture the complicated correlation between the im-ages and text, we develop a joint multimodal learning strategy that focuses on the crucial text-based features based on the attended visual features, followed by the self-attention unit to extract the sentiment-rich multi-modal features. T To show our method’s effectiveness, we conduct exper-iments on four real-world datasets: 1) Strongly labelled: MVSA-Single and MVSA-Multiple datasets, 2) Weakly labelled: Flickr and Getty images datasets. The perfor-mance of our model is validated in terms of Accuracy, Precision, Recall, F1 score, ROC curves, and PRC curves metrics, which confirm the superiority of our model. Finally, ablation experiments and visualizations of at-tention maps are done to analyze the impact of atten-tion mechanisms on the visual and textual data samples for multimodal sentiment classification. The rest of the manuscript is organized as follows: Sec-tion 2 discusses the related work in multimodal sentiment analysis and multimodal fusion. Section 3 elaborates on our proposed network. Section 4 presents the experimental results, and Section 5 concludes our work and outlines fu-ture work.
2 R
ELATED W ORK
This section discusses the work related to the field of mul-timodal sentiment analysis and multimodal fusion.
With the rapid popularity of smartphone devices, an enor-mous amount of data is generated on social media. This makes sentiment analysis on multiple modalities a popular field of research. The early works have majorly focused on feature selection based approaches. Baecchi et al. [11] ap-plied the continuous bag-of-words (CBOW) model for ex-tracting textual information and denoising autoencoder for getting the robust visual features on Twitter short mes-sages. Fang et al. [12] proposed a probabilistic graphical model to capture the correlation between the textual and visual data of Flickr. Ji et al. [10] proposed a hypergraph learning framework which computes the relevance among the textual, visual, and emoticon modalities on Sina Weibo microblogs. Dai et al. [13] constructed a structured forest to generate the bag of affective words, which reduces the gap between the low-level features and affective de-scriptors on Multilingual Visual Sentiment Ontology da-taset. Due to the powerful performance of deep learning-based approaches [14] , these techniques are increasingly being applied for multimodal sentiment analysis. Xu et al. [15] applied word-level and sentence-level attention for modeling the textual data and the CNN-LSTM approach for extracting the semantic information in images. Chen et al. [16] used emoticons as weak labels and leaned joint fea-tures from the image and textual modalities using CNN and dynamic CNN. A probabilistic graphical model was applied to infer the correlation among the predicted labels of various modalities. Zhao et al. [17] experimented on five pre-trained CNN models for extracting the features from images, and word2vec was applied for textual feature ex-traction. Cosine similarity was applied to quantity the con-sistency between the features from both the modalities. Fi-nally, the features were merged for the classification. Yu et al. [18] proposed a network for entity-level multimodal sentiment classification. They extracted and represented the target entity using the LSTM network, followed by cap-turing the contextual information using the attention mechanism. Bilinear pooling was utilized to capture the in-teractions among the different modalities.
Multimodal fusion [19] integrates the features from multi-ple data sources to predict the final class value. There are mainly three types of fusion strategies: early fusion, late fusion, and intermediate fusion. Early fusion combines data from the input features of multiple modalities to obtain a single feature vector. Poria et al. [7] extracted the visual and textual multimodal fea-tures through deep based networks and fused them using multiple kernel classifier for sentiment classification. How-ever, early fusion cannot capture the time-synchronicity of different modalities and often results in a high dimen-sional redundant feature vector. Late fusion refers to a combination of results from multiple classifiers, where each classifier is trained on a separate modality. However, late fusion ignores the low-level interaction of the modali-ties. Xu et al. [8] developed a bi-directional attention model, which exploits the correlation between the visual and textual contents simultaneously to fuse the attended visual-textual features via late fusion. In Xu et al. [20] , the cross-modal relation among the images, text, and social links was explored through multi-level LSTMs. A joint re-lationship was obtained to learn the inter-modal correla-tions at different levels. In deep learning models, intermediate fusion is com-monly employed as a deep multimodal fusion strategy, where the input data is changed into a higher-level repre-sentation through multiple layers. Huang et al. [9] pro-posed multimodal attentive fusion, which focuses sepa-rately on the visual attention model and semantic attention model, followed by intermediate-fusion-based multi-modal attention. In intermediate fusion, the different rep-resentations are fused at different depths, leading to over-fitting where the network fails to model the relationship between each modality. Hence, a careful design approach needs to be followed. Although the existing works for multimodal sentiment analysis have shown significant improvements, several factors need to be incorporated for effective results. Firstly, current works fail to extract important sentiment words corresponding to the image features. Secondly, they do not utilize the channel dimension to generate robust visual fea-tures, which enhances the crucial channels in the given im-age and boosts the model’s overall performance. Thus, we aim to develop a Deep Multi-level Attentive network (DMLANet), which tackles the above challenges.
3 P
ROPOSED M ETHODOLOGY
This section presents the details of the proposed Deep Multi-level Attentive network (DMLANet). In Section 3.1, we give an outline of the proposed network. Then we pre-sent the visual attention module in Section 3.2, which gen-erates significant bi-attentive visual features by utilizing channel attention and spatial attention. Finally, Section 3.3 discusses a joint attended multimodal learning process that learns a combined representation for textual and vis-ual features by applying semantic attention, which measures the semantic closeness of text and visual fea-tures, followed by a self-attention mechanism which ex-tracts the crucial multimodal features for sentiment classi-fication.
Let D represent the given set of documents. For each doc-ument 𝑑 ∈ 𝐷, let
𝐼 = {𝐼 , 𝐼 , … . . , 𝐼 𝑛 } denote the set of images for the visual component of the document and 𝑇 ={𝑇 , 𝑇 , … . , 𝑇 𝑛 } denote the set of text descriptions or the se-quence of sentences for the textual component of the doc-ument. Each of the sentence 𝑇 𝑖 is composed of a sequence of 𝑤 𝑖 , 𝑖 ∈ [1, 𝑆] . Each document is further associated with one of the following sentiment labels: positive, negative, and neutral (Flickr and Getty datasets are labeled with pos-itive and negative sentiments only). Thus, the objective is to predict the sentiment labels on the unseen documents by training the network on the training corpus. Fig. 1 shows the block diagram of the proposed frame-work. In the visual attention module, we employ channel-based attention, which enhances the information-rich channels, and spatial or region-based attention, which fur-ther concentrates on the emotional regions based on at-tended channels to get the bi-attentive visual feature map. In joint attended multimodal learning, semantic attention is applied to measure the emotional words related to the bi-attentive visual features. Next, we combine the attended word features and bi-attentive visual features and pass them to the self-attention block, which automatically high-lights the important multimodal features. These features are then passed to the classifier for the sentiment classifi-cation. Recently, attention networks have shown significant performance in many computer vision tasks [21] . They make CNN learn and focus on the crucial information by suppressing unnecessary information, thus improving the overall classification performance. We achieve this by se-quentially generating the bi-attentive map along the spa-tial and channel dimensions separately to magnify CNN’s representation power. This approach was popularly used for the task of object detection [22] . However, in multi-modal sentiment classification, most of the previous works have ignored the channel dimension for obtaining the vis-ual features. This is important because channel-based at-tention concentrates on the information-rich channels, i.e., they highlight ‘what’ are the crucial elements in the given image. The upcoming Section 3.2.1, 3.2.2, and Fig. 2 de-scribes the entire visual attention module.
For each image 𝐼 𝑖 , we obtain the feature map 𝑀 ∈[𝐻 ∗ 𝑊 ∗ 𝐶] using the Inception V3 [23] network. In chan-nel attention, we apply global average pooling, and max average pooling of feature maps to generate the average-pooled and max-pooled features, respectively. Each fea-ture is passed to a multilayer perceptron network with one hidden layer followed by the ReLu activation function, and the elements are concatenated to get the final attention map 𝐴 𝑐 = (1 ∗ 1 ∗ 256) . The reason to apply ReLu over tanh is that it converges quickly and results in cheaper computation. The channel attention process can be sum-marized in Eq (1) as follows: 𝐴 𝑐 = 𝑅𝑒𝐿𝑢 [𝑊 (𝑊 (𝑀𝑃 (𝑀))) + 𝑊 (𝑊 (𝐺𝐴𝑃 (𝑀)))] (1) where, 𝑊 , 𝑊 are the weights of the multilayer perceptron, 𝑀 denotes the feature map, 𝑀𝑃 = max-pooling layer, 𝐺𝐴𝑃 = global average pooling layer, and 𝐴 𝑐 is the channel Bi-attentive Visual feature map
Output
Channel Attention Spatial At-tention Self-At-tention
LSTM
Image Text description
Yesterday: In houxforMP
Dense
VISUAL ATTENTION MODULE JOINT ATTENDED MULTIMODAL LEARNING
Semantic Attention
Concatenate 𝒗 𝒇 𝒗 𝒇 𝒕 𝒇 𝑱 𝒇 Convolution Layers 𝑴 Fig. 1. Block diagram of the proposed DMLANet attention. Hence, channel attention extracts ‘ what’ are the meaningful features in a given image by squeezing the spa-tial dimensions of the feature map using the average and max-pooling layers and merging the output vectors using element-wise summation.
The spatial attention map tells ‘ where’ is the informative part of the image, i.e., it locates the relevant image regions according to the attended channel based features. The in-put feature map 𝑀 is element-wise multiplied with the channel attention map 𝐴 𝑐 to generate the channel refined feature map 𝐹 . The channel refined feature maps are con-catenated using average pooled and max pooled layers and are fed into a convolutional layer with 7*7 kernel size to generate the spatial attended features, which we refer to as the bi-attentive visual features as they are the combina-tion of the channel-attended and spatial-attended visual features. The process of converting the channel refined fea-ture map 𝐹 into the spatial attended features 𝐴 𝑠 can be summarized in Eq (2) as follows: 𝐴 𝑠 = 𝑅𝑒𝐿𝑢 [𝐶𝑜𝑛𝑣2𝐷 (𝐴𝑃 (𝐹); 𝑀𝑃 (𝐹))] (2) Finally, we obtain the following sequence of bi-attentive visual features, as shown in Eq (3) below: 𝑣 𝑓 = {𝑣 , 𝑣 , … , 𝑣 𝑛 }, 𝑣 𝑓 ∈ 𝑅 𝑚∗𝑑 (3) Where, m= number of regions, and d = feature dimen-sion of each region. Each word 𝑤 𝑖 is transformed into a real-valued vector through pre-trained embedding matrix (we use Glove embeddings), and then optimized by applying LSTM (Long-Short Term Memory) network which gives the high-level textual features as 𝑡 𝑓 . Existing work fails to detect the sentimental words that are related to the images. However, we address this problem by exploiting the correlation be- https://nlp.stanford.edu/projects/glove/ tween the words and the different visual features gener-ated in Section 3.2. We measure the semantic closeness of the textual features with visual content by combining both the features using element-wise multiplication and gener-ating the joint features 𝑚 𝑓 . This is shown in Eq (4) as fol-lows: 𝑚 𝑓 = tan ℎ (𝑊(𝑣 𝑓 ⊙ 𝑡 𝑓 )) (4) Where, W are the learnable weights. The attention scores are computed, as shown in Eq (5) below: 𝛼 𝑓 = exp(𝑚 𝑓 )∑ exp (𝑚 𝑓 ) 𝑓 (5) Finally, we obtain the attended word-level features, which measures the emotional textual features related to the visual features as follows: 𝑠 𝑓 = ∑ 𝛼 𝑓 ∗ 𝑡 𝑓𝑓 (6) Next, we concatenate the obtained attended textual fea-tures 𝑠 𝑓 with the visual features to obtain the joint-multi-modal features 𝐽 𝑓 = (𝑠 𝑓 , 𝑣 𝑓 ). Since, in multimodal learning, not all the modalities contribute equally in the classifica-tion task [24]. Hence, we apply self-attention networks, which take the multimodal feature vectors as input and au-tomatically identifies the crucial weights corresponding to each modality, as shown in Eq (7) below: 𝜐 𝑓 = exp(𝜑(𝑊∗𝐽 𝑓 + 𝑏)) ∑ exp(𝜑(𝑊∗𝐽 𝑓 + 𝑏)) 𝑓 (7) Where, W and b are the learnable weights, and 𝜑 is the activation function. Thus, in a self-attention network, multiple input modal-ities are allowed to interact with each other to find the in-put that gets more attention, which tells the importance of the different multimodal input features in the sequence. The joint attended multimodal features are computed as the weighted average over all the feature sequence, as shown below: 𝑀 = ∑ 𝜐 𝑓𝑓 ∗ 𝐽 𝑓 (8) 𝑴 𝑴𝑷 𝑮𝑨𝑷
𝑴𝑳𝑷
𝑴𝑳𝑷 ⨁ 𝑹𝒆𝑳𝒖 𝑨 𝒄 ⊗ 𝑭 𝑨𝑷 𝑴𝑷 𝑹𝒆𝑳𝒖 𝑨 𝒔 𝑪𝒐𝒏𝒗𝟐𝑫 𝑴 : Feature Map from CNN 𝑴𝑷 : Max Pooling 𝑮𝑨𝑷 : Global Average Pooling
𝑴𝑳𝑷 : Multi-layer Perceptron ⨁ : Element-wise summation ⊗ : Element-wise multiplication 𝑹𝒆𝑳𝒖 : Rectified Linear Unit 𝑭: Channel-refined feature map
Conv2D: convolution block 𝑨 𝒄 : Channel-attention maps 𝑨 𝒔 : Spatial-attention maps Fig. 2 Block diagram explaining the Visual Attention Module The obtained attended multimodal features 𝑀 are passed as an input to the softmax classifier for sentiment classification as follows: 𝑃 (𝑠) = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 (𝑊 𝑠 ; 𝑀) (9) The whole network is trained on a training set by mini-mizing the cross-entropy loss with backpropagation as fol-lows: 𝐿𝑜𝑠𝑠 = − ∑ log(𝑃(𝑠), 𝑦) (10) Where y is the actual sentiment label of the training data .
4 E
XPERIMENTS
In this section, we conduct several experiments to confirm the efficacy of our DMLANet on popular real-world da-tasets and report the quantitative and qualitative results.
We collected four large-scale, real-world datasets from various social media platforms for conducting the multi-modal sentiment classification. Table I shows the complete statistics of each dataset. Further, the datasets are ex-plained as follows: A. MVSA (Multi-View Sentiment Analysis Dataset):
The MVSA dataset [25] consists of two separate da-tasets. MVSA-Single which contains 5129 image-text pairs from Twitter, where each pair is labelled by a single anno-tator. MVSA-Multiple consists of 19600 image-text pairs which are labelled by three annotators. The actual senti-ment is calculated by taking the majority vote out of the three sentiments (positive, negative, and neutral) for each modality separately. In both cases, the annotator’s judg-ment for the text and image sentiment label is independ-ent. However, many tweets may result in inconsistent tex-tual and image sentiment label. We follow the following rules to deal with inconsistent sentiment labels between different modalities: The tweets with one positive label and one negative label or vice-versa are removed. If the tweet has one positive (or negative) label and other neu-tral label, then the final multimodal sentiment is positive (or negative). Finally, we get 4511 image-text pairs for MVSA-Single and 17024 image-text pairs for MVSA-Mul-tiple datasets, respectively. B. Flickr : We collect the image-text pairs from the Flickr website by using the 1200 adjective-noun pair (ANPs), as de-scribed in [26] . The images were weakly labeled according to the sentiment of the ANP into the positive and negative sentiment category only. We also collect the English de-scriptions associated with the images. The images with too short text (<5 words) and too long text (>100 words) were removed. Thus, we obtained a dataset of 276,571 weakly labelled image-text pairs. C. Getty Images:
Getty Images is a supplier of videos, photos, music hav-ing relatively formal text descriptions, which can be con-veniently browsed by the users. Similar to [27] , we query Getty images with 101 sentimental keywords from the Bal-anced Affective Word List Project to download the image with their corresponding text description. The down-loaded image-text pairs were weakly labeled as per the sentiment keywords into the positive and negative senti-ment category, giving us 453,289 image-text pairs. Table I O VERALL S TATISTICS OF E ACH D ATASET Datasets
MVSA-Single 2683 1358 470 4511 Strong MVSA-Multiple 11318 1298 4408 17024 Strong Flickr 129317 147254 - 276,571 Weak Getty Images 235732 217557 - 453,289 Weak
The proposed DMLANet is implemented in Python using Keras deep learning framework. The experiments were performed on a 64-bit Windows 10 machine with 128 GB RAM and NVIDIA Titan-RTX GPUs. We set the learning rate = 0.001, batch size = 256 with Adam optimizer. Drop-out is used to avoid overfitting. We performed experi-ments via a five-fold cross-validation strategy. The da-tasets are split in the 80:10:10 ratio for training, validation, and testing sets, respectively. The final accuracy is calcu-lated by averaging the results across each of the test fold. The model achieving the highest validation accuracy is se-lected for the testing phase.
In this section, we validate the proposed model on all four datasets, as shown in Fig. 3. We use the following evalua-tion metrics: Precision, Recall, F1 score, and accuracy to validate our model. All the evaluation metrics are ranged from 0 to 100%, where higher the value of the metrics, the better is the performance of the model. For multi-class clas-sification, the average F1 score and accuracy are 79.59% and 79.47%, respectively for MVSA-Single, and 75.26%, and 77.89%, respectively, for MVSA-Multiple. For binary-class classification, the average F1 score and accuracy are 89.19% and 89.30%, respectively for Flickr, 92.60%, and 92.65% respectively for Getty images. Since, we have imbalanced samples in the dataset, hence we used ROC (Receiver operating characteristics) and PRC (Precision-Recall curves) to validate the perfor-mance of our model further. The ROC curves in Fig. 4 (a) shows that our model has shown increased TPR (True Pos-itive rates) on all the datasets. Similarly, AUC (Area under ROC curves) helps to compare the different ROC curves better. The highest value of AUC is 94.46%, which is achieved by Getty images. However, still the model can distinguish between the classes for both the binary and multi-class sentiment classification. Compared with ROC, the PRC are more suitable for imbalanced datasets. Hence, we plotted the PRC in Fig. 4 (b) between the precision and recall values to compare the performance of our model across the datasets. As evident from the curves, the joint attended learning approach in DMLANet has shown effec-tive results for learning the multimodal features for the sentiment classification. To show the evolution of model’s performance, we plot the training and validation curves for MVSA-Multiple da-taset. Fig. 5 (a) shows that training and validation loss de-creases with increasing data and (b) shows that training and validation accuracy increases with the data. It can be clearly seen that as more and more data is supplied to the model, it can learn the adequate features from the data and finally converges after approximately 50 epochs.
This section compares our work with state-of-the-art methods for the MVSA-Single, MVSA-Multiple, Flickr, and Getty Images datasets.
For MVSA-Single and MVSA-Multiple datasets, following baselines were used for comparison: SentiBank and SentiStrength [26] : The SentiBank ex-tracts 1200 ANP as mid-level features for image classi-fication, and SentiStrength utilizes grammar and spell-ings style from the text. Both the techniques are com-bined to handle the multimodal sentiment classifica-tion. MNN (Merged Neural Network) [28] : MNN utilizes CNN to extract the multimodal features which are fused by residual model using early (Early-RMNN) and late fusion (Late-RMNN). HSAN (Hierarchical Semantic Attentional Network)
MVSA-Single
Precision Recall F1 score Accuracy 020406080100 Positive Neutral Negative Average
MVSA-Multiple
Precision Recall F1 score Accuracy P r e c i s i o n Recall (b)
MVSA-Single
MVSA-MultipleFlickrGetty Images00.10.20.30.40.50.60.70.80.91 0 0.2 0.4 0.6 0.8 1 T P R FPR (a)
MVSA-Single (AUC=80.23)MVSA-Multiple (AUC=78.43)Flickr (AUC=90.36)Getty Images (AUC=94.46)
Flickr
Precision Recall F1 score Accuracy Getty Images
Precision Recall F1 score Accuracy
Fig. 3 Experimental results on the datasets (%) Fig. 4 (a) ROC curves (b) PRC curves for the datasets [15] : The text-based HAN extracts textual information from tweets, and semantic image features are extracted by CNN-LSTM model. The important words are re-flected by using the attention mechanism. CoMN (Co-Memory Network) [29] : A stacked co-memory network is developed, which uses text fea-tures to capture image feature maps and image infor-mation is utilized for identifying the textual keywords. MultiSentiNet [30] : This model extracts the objects and scenes from the image, followed by attention-based LSTM to fetch the textual features. Finally, the features are fused for the final sentiment classification. FENet (Fusion-Extraction Network) [34] : It uses Inter-active Information Fusion (IIF) mechanism, which ap-plies attention features across both the modalities and Specific Information Extraction layer (SIE), which is based on gated convolution, followed by late fusion for combining both the modalities.
Table II
COMPARISON RESULTS OF DIFFERENT METHODS FOR MVSA DATASETS (%)
Table II displays the comparative results for MVSA-Single and MVSA-Multiple datasets. As compared to the best performing baseline, FENet [34] , our model achieves 5% more accuracy for MVSA-Single and 6% more accuracy for MVSA-Multiple dataset. This shows that our multi-level attention contributes to the fine-grained features for sentiment classification. Hence, it is evident that the F1 and accuracy scores of our DMLANet are higher than all the other baselines.
For Flickr and Getty images, we compare our work with the following baselines: AHRM (Attention-based Heterogeneous Relational Model) [31] : : The visual features are captured by dual attention mechanism, followed by graph convolu-tional network which combines the social context in-formation. BDMLA (Bi-directional Multi-level Attention) [8] : The joint learning is done by two independent net-works that learn the visual attention and semantic at-tention, followed by fusing the modalities with MLP. AMGN (Attention-based modality Gated networks) [32] : It utilizes visual and semantic attention model to obtain word-related visual features, followed by gated LSTM to extract more emotional features in the visual and textual modalities. HDF (Hierarchical Deep Fusion) [20] : HDF captures the correlations between the image and textual con-tent by using hierarchical LSTM. Late fusion is em-ployed using MLP. DMAF (Deep Multimodal Attentive Fusion) [9] : DMAF uses deep CNN to extract the visual features and LSTM based semantic attention for modelling the textual data. Joint Cross-modal model [33] : The textual features are computed using attention-based GRU, and visual features are calculated using maximum mean dis-crepancy. Finally, attention-based LSTM is used to compute the final-sentiment polarity.
Table III
COMPARISON RESULTS OF DIFFERENT METHODS FOR FLICKR AND GETTY IMAGES (%) (%)
The comparative results on Flickr and Getty images are shown in Table III. The accuracy obtained on Flickr da-taset is 89.30% and on Getty images is 92.65%, which is 4% higher than AMGN [9] , which performs best amongst the baselines. We also see that the accuracy and F1 scores on
Datasets Methods MVSA-Single MVSA-Multiple F1 Accuracy F1 Accuracy [26] SentiBank & SentiStrength - 67.09 - 67.90 [15] HSAN - 66.90 - 67.76 [29] CoMN
Ours DMLANet 79.59 79.47 75.26 77.89 Da-tasets Methods Flickr Getty images F1 Accuracy F1 Accuracy [31] AHRM 87.5 87.1 88.4 87.8 [8] BDMLA
Ours DMLANet 89.19 89.30 92.60 92.65
Fig. 5 (a) Training and Validation Loss curves (b) Training and Validation accuracy curves on MVSA-Multiple Dataset L o ss Epochs (a)
Training_LossValidation_Loss 00.10.20.30.40.50.60.70.80.91 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 A cc u r a c y Epochs (b)
Training_AccuracyValidation_Accuracy
Getty images are higher than Flickr. This may be because, as compared to Flickr, the textual descriptions on Getty are more formal and relevant to the image content. Thus, we can say that our proposed model effectively exploits the correlation between the textual and image modalities for all four datasets.
In this section, we perform an ablation study to evaluate the contribution of each module. We conduct an ablation study on two datasets: MVSA-Multiple (Multiclass and Strongly labelled dataset) and Flickr (Binary class and Weakly labelled dataset). We retrain our model by ablating the following crucial components: Spatial attention (SA) + channel attention (CA), Semantic attention (SMAtt) and Self-Attention (SAtt). The results are shown in Table IV.
TABLE IV ABLATION STUDIES ON MVSA-MULTIPLE AND FLICKR DA-TASETS
Da-tasets Model F1 score (%) Accuracy (%)
MVSA-Multi-ple DMLANet w/o (SA + CA) 71.29 70.85 DMLANet w/o (SMAtt) 70.17 70.00 DMLANet w/o SAtt 73.98 73.54
DMLANet 75.26 77.89
Flickr DMLANet w/o (SA + CA) 85.54 85.77 DMLANet w/o (SMAtt) 82.44 81.90 DMLANet w/o SAtt 88.01 87.98
DMLANet 89.19 89.30 DMLANet w/o (SA + CA):
This ablated model doesn’t use the spatial and channel attention block. The features obtained from the inception V3 module are directly given to the semantic attention block. This gives a drop in the F1 score and accuracy values for both the datasets, which clearly shows the channel-attended visual features and region-attended features helps in learning the discriminative image features. DMLANet w/o (SMAtt):
Here, we ablate the seman-tic attention block and directly concatenate the bi-at-tentive visual features with the high-level textual fea-tures obtained from LSTM. In this case, we observe a significant drop in the performance of the model for both the datasets. Around 7% accuracy is dropped for MVSA-Multiple dataset, and 8% is dropped for the Flickr dataset. These results indicate the importance of semantic attention, which tells how closely the words are linked to the contents of the images. Thus, it explores the correlation between both the features of the modalities. DMLANet w/o SAtt:
In this ablated model, the self-attention module is not used. The joint multimodal features 𝐽 𝑓 are directly fed into the dense layer for the final classification. We observe that the F1 score drops by 2% and 1% for MVSA-Multiple and Flickr datasets, respectively. Similarly, the accuracy drops to 73.54% and 87.98% for both the datasets. These results also show that it is necessary to focus only on the essential sentiment-rich multimodal features, as not all the fea-tures are important for the classifier. Based on the results in Table IV, we conclude that the Fig. 6 Quantitative analysis of DMLANet for Positive image-text pairs
Original Image-Text Pair Attended Image Attended Text multi-level attention in the form of channel attention, spa-tial attention, semantic-attention, and self-attention ex-ploits the correlation between the visual and textual mo-dalities by filtering out the irrelevant and redundant in-formation. This enhances the performance of the multi-modal data for sentiment analysis. In this part, we evaluate our proposed model by quantita-tively showing the sentiment classification results. We ran-domly select three positive samples (Fig. 6) and three neg-ative samples (Fig. 7) from the MVSA datasets. We use gra-dient-based class activation maps [35] to visualize the vis-ual attention weights, whereas the background color re-flects the semantic attention. The brighter the color, the higher is the attended semantic score. Together, the visual and semantic attention tells “what” our model infers from the image and text sentiment pair. As seen in Fig. 6, the visual attention is drawn from right image regions by paying attention to more affective regions, which contributes towards the positive sentiment. The semantic attention focuses on words like “amazing”, “excited”, “wonderful”, which conveys the positive senti-ment. Similarly, in Fig. 7, negative sentiments are ex-pressed by focusing on crucial regions and words like “starved”, “crap”, “betrayed”. However, it is difficult to tell the exact text’s sentiment in many cases, since a text may contain many sarcastic statements where positive words may sarcastically convey negative sentiments. For e.g., In Fig 7, the second example uses some positive words like “ecstatic”, still it conveys a negative sentiment. How-ever, combining visual attention helps to classify the sam-ple as negative. C ONCLUSION
This paper proposes a Deep Multi-level Attentive net-work (DMLANet) to model the correlation between image and text modalities and extracting only the sentiment-rich multimodal features. The visual attention block applies channel and spatial attention to generate robust bi-atten-tive visual features. Moreover, the joint-attended multi-modal learning aims to acquire high-quality representa-tions from text and image modalities by focusing on the words that are related to the image contents. Experimental results on four real-world datasets show promising results, as validated by Precision, Recall, F1 score, accuracy, ROC, and PRC metrics. Hence, the proposed model achieves the best performance as compared to other multimodal based approaches. The proposed model focuses on data samples having some fine-grained correlation between image-text pairs. However, this is not true in reality. Hence, we plan to de-velop a robust fusion method that could work well on da-tasets that do not have a close cross-modal correlation in the future. R EFERENCES [1] A. Yadav and D. K. Vishwakarma, "A deep learning architecture of RA-DLNet for visual sentiment analysis,"
Original Image-Text Pair Attended Image Attended Text
Fig. 7 Quantitative analysis of DMLANet for Negative image-text pairs Multimedia Systems, vol. 26, p. 431–451, 2020. [2] L. Wang, J. Niu and S. Yu, "SentiDiff: Combining textual information and sentiment diffusion patterns for Twitter sentiment analysis,"
IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 10, pp. 2026 - 2039, 2019. [3] K. Denecke and Y. Deng, "Sentiment analysis in medical settings: New opportunities and challenges,"
Artificial Intelligence in Medicine, vol. 64, pp. 17-27, 2015. [4] A. Sharma and U. Ghose, "Sentimental Analysis of Twitter Data with respect to General Elections in India,"
Procedia Computer Science, vol. 173, pp. 325-334, 2020. [5] A. Yadav and D. K. Vishwakarma, "A unified framework of deep networks for genre classification using movie trailer,"
Applied Soft Computing, vol. 96, p. 106624, 2020. [6] S. W. Chan and M. W. Chong, "Sentiment analysis in financial texts,"
Decision Support Systems, vol. 94, pp. 53-64, 2017. [7] S. Poria, I. Chaturvedi, E. Cambria and A. Hussain, "Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis," in
IEEE 16th International Conference on Data Mining , Spain, 2016. [8] J. Xu, F. Huang, X. Zhang, S. Wang, C. Li, Z. Li and Y. He, "Visual-textual sentiment classification with bi-directional multi-level attention networks,"
Knowledge-Based Systems, vol. 178, pp. 61-73, 2019. [9] F. Huang, X. Zhang, Z. Zhao, J. Xu and Z. Li, "Image–text sentiment analysis via deep multimodal attentive fusion,"
Knowledge-Based Systems, vol. 167, pp. 26-37, 2019. [10] R. Ji, F. Chen, L. Cao and Y. Gao, "Cross-Modality Microblog Sentiment Prediction via Bi-Layer Multimodal Hypergraph Learning,"
IEEE Transactions on Multimedia , vol. 21, no. 4, pp. 1062-1075, 2018. [11] C. Baecchi, T. Uricchio, M. Bertini and A. D. Bimbo, "A multimodal feature learning approach for sentiment analysis of social network multimedia,"
Multimedia Tools and Applications, vol. 75, no. 5, pp. 2507-2525, 2016. [12] Q. Fang, C. Xu, J. Sang, M. S. Hossain and G. Muhammad, "Word-of-mouth understanding: Entity-centric multimodal aspect-opinion mining in social media,"
IEEE Transactions on Multimedia, vol. 17, no. 12, pp. 2281-2296, 2015. [13] S. Dai and H. Man, "Integrating Visual and Textual Affective Descriptors for Sentiment Analysis of Social Media Posts," in
IEEE Conference on Multimedia Information Processing and Retrieval , Florida, 2018. [14] A. Yadav and D. K. Vishwakarma, "Sentiment analysis using deep learning architectures: a review,"
Artificial Intelligence Review, vol. 53, no. 6, pp. 4335-4385, 2019. [15] N. Xu, "Analyzing multimodal public sentiment based on hierarchical semantic attentional network," in
IEEE International Conference on Intelligence and Security Informatics (ISI) , China, 2017. [16] F. Chen, R. Ji, J. Su, D. Cao and Y. Gao, "Predicting Microblog Sentiments via Weakly Supervised Multi-Modal Deep Learning,"
IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 997-1007, 2017. [17] Z. Zhao, H. Zhu, Z. Xue, Z. Liu, J. Tian, M. Chua and M. Liu, "An image-text consistency driven multimodal sentiment analysis approach for social media,"
Information Processing and Management, vol. 56, no. 6, p. 102097, 2019. [18] J. Yu, J. Jiang and R. Xia, "Entity-Sensitive Attention and Fusion Network for Entity-Level Multimodal Sentiment Classification,"
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 429-439, 2019. [19] L. Meng, A.-H. Tan and D. Xu, "Semi-supervised heterogeneous fusion for multimedia data co-clustering,"
IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2293-2306, 2013. [20] J. Xu, F. Huang, X. Zhang, S. Wang, C. Li, Z. Li and Y. He, "Sentiment analysis of social images via hierarchical deep fusion of content and links,"
Applied Soft Computing , vol. 80, pp. 387-399, 2019. [21] J. Wang, W. Wang, L. Wang, Z. Wang, D. D. Feng and T. Tan, "Learning visual relationship and context-aware attention for image captioning,"
Pattern Recognition, vol. 98, p. 107075, 2020. [22] S. Woo, J. Park, J.-Y. Lee and . I. S. Kweon, "CBAM: Convolutional Block Attention Module,"
Proceedings of the European conference on computer vision (ECCV), pp. 3-19, 2018. [23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the inception architecture for computer vision,"
Proceedings of the IEEE conference on computer vision and pattern recognition., pp. 2818-2826, 2016. [24] H. Ma, W. Li, X. Zhang, S. Gao and S. Lu, "AttnSense: Multi-level Attention Mechanism For Multimodal Human Activity Recognition.,"
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pp. 3109-3115, 2019. [25] T. Niu, "Sentiment analysis on multi-view social data,"
International Conference on Multimedia Modeling, Springer, pp. 15-27, 2016. [26] D. Borth, R. Ji, T. Chen, T. Breuel and S.-F. Chang, "Large-scale visual sentiment ontology and detectors using adjective noun pairs,"
Proceedings of the 21st ACM international conference on Multimedia, pp. 223-232, 2013. [27] Q. You, J. Luo, H. Jin and J. Yang, "Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia,"
Proceedings of the Ninth ACM international conference on Web search and data mining, pp. 13-22, 2016. [28] N. Xu and W. Mao, "A residual merged neutral network for multimodal sentiment analysis,"
IEEE 2nd International Conference on Big Data Analysis (ICBDA), pp. 6-10, 2017. [29] N. Xu, W. Mao and G. Chen, "A Co-Memory Network for Multimodal Sentiment Analysis,"
The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval., pp. 929-932, 2018. [30] N. Xu and W. Mao, "MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis,"
Proceedings of the 2017 ACM on Conference on Information and Knowledge Management., pp. 2399-2402, 2017. [31] J. Xu, Z. Li, F. Huang, C. Li and P. S. Yu, "Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations,"
IEEE Transactions on Industrial Informatics, pp. 1-8, 2020. [32] F. Huang, K. Wei, J. Weng and Z. Li, "Attention-Based Modality-Gated Networks for Image-Text Sentiment Analysis,"
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 3, pp. 1-9, 2020. [33] K. Zhang, Y. Zhu, W. Zhang, W. Zhang and Y. Zhu, "Transfer Correlation Between Textual Content to Images for Sentiment Analysis,"
IEEE Access, vol. 8, pp. 35276-35289, 2020. [34] T. Jiang, J. Wang, Z. Liu and Y. Ling, "Fusion-Extraction Network for Multimodal Sentiment Analysis,"
Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp. 785-797, 2020. [35] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh and D. Batra, "Grad-CAM: Why did you say that?," in arXiv preprint arXiv:1611.07450 , 2016. Ashima Yadav received B.Sc.(Hons) in Computer Science from University of Delhi, New Delhi, India in 2013, and M.C.A. from Guru Gobind Singh In-draprastha University, New Delhi, India in the year 2016. She is currently working to-wards the Ph.D. degree from the Depart-ment of Information Technology, Delhi Technological University, New Delhi, India. Her current research interest includes deep learning, natural language pro-cessing, machine learning, Emotion Recognition, and sentiment analysis.
Dinesh Kumar Vishwakarma (M’16, SM’19) received the B.Tech. degree from Dr. Ram Manohar Lohia Avadh University, Faizabad, India, in 2002, the M.Tech. de-gree from the Motilal Nehru National Insti-tute of Technology, Allahabad, India, in 2005, and the Ph.D. degree degree in the field of Computer Vision and Machine Learning from Delhi Technological Univer-sity University (Formerly Delhi College of Engineering), New Delhi, India, in 2016. He is currently an Associate Professor with the Department of Information Technology, Delhi Technologicalreceived the B.Tech. degree from Dr. Ram Manohar Lohia Avadh University, Faizabad, India, in 2002, the M.Tech. de-gree from the Motilal Nehru National Insti-tute of Technology, Allahabad, India, in 2005, and the Ph.D. degree degree in the field of Computer Vision and Machine Learning from Delhi Technological Univer-sity University (Formerly Delhi College of Engineering), New Delhi, India, in 2016. He is currently an Associate Professor with the Department of Information Technology, Delhi Technological