[PDF] A Novel Deep Learning Method for Textual Sentiment Analysis

Abstract

Sentiment analysis is known as one of the most crucial tasks in the field of natural language processing and Convolutional Neural Network (CNN) is one of those prominent models that is commonly used for this aim. Although convolutional neural networks have obtained remarkable results in recent years, they are still confronted with some limitations. Firstly, they consider that all words in a sentence have equal contributions in the sentence meaning representation and are not able to extract informative words. Secondly, they require a large number of training data to obtain considerable results while they have many parameters that must be accurately adjusted. To this end, a convolutional neural network integrated with a hierarchical attention layer is proposed which is able to extract informative words and assign them higher weight. Moreover, the effect of transfer learning that transfers knowledge learned in the source domain to the target domain with the aim of improving the performance is also explored. Based on the empirical results, the proposed model not only has higher classification accuracy and can extract informative words but also applying incremental transfer learning can significantly enhance the classification performance.

Full PDF

11 A Novel Deep Learning Method for Textual Sentiment Analysis

Hossein Sadr * Department of Computer Engineering, Rasht Branch, Islamic Azad University, Rasht, Iran

[email protected]

Mozhdeh Nazari Solimandarabi

Department of Computer Engineering, Rasht Branch, Islamic Azad University, Rasht, Iran

[email protected]

Mir Mohsen Pedram Department of Electrical and Computer Engineering, Faculty of Engineering, Kharazmi University, Tehran, Iran

[email protected]

Mohammad Teshnehlab

Industrial Control Center of Excellence , Faculty of Electrical and Computer Engineering, K. N. Toosi University

Tehran, Iran

[email protected]

Abstract —Sentiment analysis is known as one of the most crucial tasks in the field of natural language processing and Convolutional Neural Network (CNN) is one of those prominent models that is commonly used for this aim. Although convolutional neural networks have obtained remarkable results in recent years, they are still confronted with some limitations. Firstly, they consider that all words in a sentence have equal contributions in the sentence meaning representation and are not able to extract informative words. Secondly, they require a large number of training data to obtain considerable results while they have many parameters that must be accurately adjusted. To this end, a convolutional neural network integrated with a hierarchical attention layer is proposed which is able to extract informative words and assign them higher weight. Moreover, the effect of transfer learning that transfers knowledge learned in the source domain to the target domain with the aim of improving the performance is also explored. Based on the empirical results, the proposed model not only has higher classification accuracy and can extract informative words but also applying incremental transfer learning can significantly enhance the classification performance.

Keywords-Sentiment Analysis, Deep learning, Convolutional neural network, Attention mechanism, Transfer learning I NTRODUCTION

During the past few years, a large amount of texts containing people's opinions, sentiments, attitudes, and emotions have been rapidly produced due to the explosive growth of social media. Considering the fact that collecting and analyzing such a large amount of unstructured data is not possible, it has been tried to provide an efficient method to collect and process them automatically. That automatic process of text analysis and computational linguistics with the aim of extracting subjective information existing in the text is known as sentiment analysis[1, 2]. Sentiment analysis is considered as one of the most active research areas in the field of natural language processing which tries to classify a piece of text containing opinions based on its polarity and determine whether an expressed opinion about a specific topic, event or product is positive or negative[3, 4]. In recent years, deep neural networks, due to their remarkable results, have attracted many researchers in this filed. However, despite the fact that deep learning models have been quite effective in the field of sentiment analysis, they still suffer from over-abstraction problems [5, 6]. It means that these models can only clarify the polarity of the document and are not able to provide a depth understanding of the text such as identifying the main word that resulted in the polarity classification [7, 8]. It must be also taken into consideration that all parts of a text are not equally important and some words in a sentence have more impact in specifying the whole meaning of the text. Moreover, deep learning models, unlike human brains, cannot pay more attention to the salient part of a text which results in a reduction in their effectiveness [9]. Recently, a new direction of deep learning models has emerged that tries to simulate the attention mechanism found in human brains. Attention mechanism tries to focus on the more important part of a text (as the human brain while reading) and neglects the less important parts[2, 10]. On the other hand, convolutional neural networks have been widely used for the aim of sentiment analysis and have achieved significant progress due to their ability in extracting robust and abstract features from the input [11-13]. In this regard, we decided to integrate the attention mechanism and convolutional neural networks to present a powerful model for sentiment analysis of texts. The intuition behind our model is that all words in a sentence are not equally important and our model can identify the most informative words and phrases of sentences using attention layer by considering the context when phrase-word level sentiment labels are not available[9, 14, 15]. Furthermore, convolutional neural networks require a large number of training data to accurately train the model as well as they need many parameters to be tuned preciously. To this end, transfer learning, as a sub-domain of machine learning, has obtained considerable attention in recent years. The focus of transfer learning is on learning knowledge from the source domain and then apply it to the target domain [16, 17]. In fact, the impact of transfer learning is visible when the training set is not large enough to train the model efficiently. Therefore, the model can be trained in large size datasets and then transformed to the target domain which can lead to performance enhancement. In this regard, we also decided to investigate the effect of transfer learning in our proposed model and explore the sensitivity of the parameters in the source domain. The main contribution of this paper is as follows: -

It is the first time that convolutional neural network and hierarchical attention layer are integrated to mimic the human brain aiming to focus on salient words and phrases in the text for the task of sentiment analysis. The proposed model is able to provide insight into which words carry more valuable information and contribute to the classification decision considering the context. -

The effect of transfer learning with and without training the model on the target domain is explored. Based on the obtained result, applying transfer learning has remarkably enhanced the overall performance. -

An extensive set of experiments were conducted in this paper to demonstrate the sensitivity of parameters and obtain the optimal values for training the model. The rest of the paper is organized as follows. Related studies are briefly described in section 2. Details of the proposed model are completely explained in section3. Datasets, model configuration, training, and experimental results are described in section 4 in detail. Conclusions and directions for future research are indicated in Section 5. R ELATED W ORK

In recent years, deep learning has been the center of attention and it can be said that it has made a revolution in various filed especially natural language processing. Nowadays, the effect of deep learning in various tasks like text classification, document summarization, machine translation, language modeling, and etc. is completely obvious and sentiment analysis is one of the important aspects of natural language processing that obtained considerable improvement using deep neural networks. Deep learning contains various networks such as Convolutional Neural Network (CNN) [18], Recursive and Recurrent Neural Network (RNN) [19, 20], and Deep Belief Network (DBN) [21] that have been extensively used in the field of sentiment analysis. In this regard, Kim et al. [18] conducted a series of experiments based on one layer convolutional neural network for this aim. They trained their models on pre-trained vectors derived from Word2Vec embedding model. They also employed multi-channel representation and various filer sizes and achieved comparable results. Against modeling sentences at the word level, Zhang et al. [11] presented a character level CNN for text classification that showed significant enhancement in classification accuracy. Moreover, Kalchbrenner et al. [22] a dynamic CNN that utilized dynamic k-max pooling. While their model was able to handle input sentences of variable lengths, it could efficiently capture short and long-term dependencies. Yin and Schutze presented a multichannel variable size CNN that employed combinations of various word embedding techniques as input [23]. In the following, Tai et al. [24]employed Long Short Term Memory (LSTM) network integrated with some complex units for sentiment analysis. They also conducted more experiments on 2 layers and bidirectional LSTM and achieved significant results. Following a similar line of research, Kuta et al. [25] proposed tree structure gated recurrent neural network which was inspired by tree structure LSTM and adaptation of Gated Recurrent Unit (GRU) to recursive model. Besides these networks, a semi-supervised model known as the Recursive neural network has been also employed for the task of sentiment analysis which uses continuous word vectors as input and hierarchical structure. In this regard, Socher et al. [19] introduced a model, known as MV-RNN, that employed both matrix and vector with the aim of representing words and phrases in the tree structure. Recursive Neural Tensor Network (RNTN) is another network in this field that was proposed by Socher et al. [26] where the tensor-based compositional matrix was used instead of matrix representation for all nodes in the tree structure. In spite of the fact that deep neural networks have achieved significant results in the field of sentiment analysis, they are still confronting with some limitations. One of their general pitfalls is that they consider all words in the sentences equally and are not able to focus on salient parts of the text [7]. To fill this lacuna, the attention mechanism has been recently adopted in many tasks of natural language processing especially sentiment analysis due to its strength in providing an effective interpretation of the text. In this regard, Yang et al. [7] modified the RNN by adding weight that played attention role for the aim of text classification. Wang et al.[27] also proposed an attention-based LSTM network that could focus on various parts of the sentences. It must be taken into consideration that despite promising results of applying attention mechanism on deep neural networks, only a few studies have been conducted in the field of sentiment analysis. On the other hand, another challenge that deep neural networks are commonly confronted with refers to the lack of training data. In fact, deep neural networks require a large number of training data to be able to accurately train the model and as the number of data is increased, their performance is also enhanced [28]. The lack of available labeled training data has yielded to the emergence of a new concept known as transfer learning. Transfer learning is used when the training set is not large enough to efficiently train the model. Therefore, the model is trained on a large dataset known as the source domain and then is transferred to the target domain which can significantly enhance the performance. Although transfer learning has been extensively used in the field of image processing, its application in natural language processing, especially sentiment analysis, is still limited. In this regard, Krizhevsky and Lee [29], presented the efficiency of transferring low-level neural layers in different tasks. In a similar study [30], the impact of transferring high-level layers in a deep neural network from source dataset to smaller size target dataset was investigated. However, it is worth mentioning that the effect of transfer learning for the task of sentiment analysis has been rarely explored [9, 31]. Considering the mentioned challenges, we decided to propose a novel CNN integrated with attention layer for the aim of sentiment analysis in this paper. Despite previous studies, the proposed model applies attention mechanism after convolutional layer to extract informative words existing in the sentences by assigning a higher weight to them which leads to the creation of the new representation of word vectors. Moreover, in order to improve the overall performance of the proposed model, we decided to employ transfer learning. In fact, the proposed model is first trained on a large dataset and then is transferred to the target dataset. M ETHODOLOGY

The proposed model of this paper consists of two main processes. The first process is a slight variant of the classical Convolutional Neural Network (CNN) presented by Kim et al.[18] which employs a hierarchical attention mechanism on convolutional neural network to emphasize words that have a significant effect in determining the meaning of sentences. The focus of the second process is on transfer learning which tries to store knowledge learned from a source domain and apply it to various but related target domains.

Process 1:

Convolutional neural network integrated with a hierarchical attention mechanism

The learning process flow of convolutional neural network integrated with a hierarchical attention mechanism is presented in Fig.1 which contains four layers. Firstly, by performing word embedding, word vectors of input sentences are extracted and are then joined to form the initial input matrix for CNN. Secondly, the CNN model is trained. As the training is completed, feature maps extracted from similar filter sizes are merged and fed to the attention layer as a new matrix in the third layer. In the following, by extracting the informative words by assigning a higher weight to them using attention mechanism and aggregating their representation to the previous features extracted by convolution neural network, new sentence vector are formed. Finally, new word vectors are entered into a fully connected network and classification is performed. More detailed mathematical deduction about each layer is provided in the following. Fig.1. Learning Process in convolutional neural network integrated with a hierarchical attention mechanism.

The convolutional neural network requires a sentence matrix as an input where each row represents a word vector. If the dimensionality of word vector is 𝑑 and the length of a given sentence is 𝑠 , the dimensionality of sentence matrix would be 𝑠 × 𝑑 where padding is set to zero before the first word and after the last word in the sentence. Setting the padding to zero makes the number of times that each word is included in receptive filed during the convolution the same without considering the word position in the sentence. As a result, the sentence matrix is Hierarchical Attention Layer Pooling and Fully Connected Layer Classification Results and Evaluation Input Data Convolutional Layer Feature Representation Layer

Word Embedding denoted by

𝐴 ∈ 𝓡 𝑠×𝑑 . In this paper, various word embedding techniques including

Word2Vec [32],

Glove [33] and

FastText [34]are employed to form the input matrix.

To produce new features, the convolutional operation must be applied to the sentence matrix. According to the fact that the sequential structure of a sentence has an important effect in determining its meaning, it is sensible to choose filter width equal to the dimensionality of word vectors (𝑑).

In this regard, only the height of filters (ℎ) , known as region size, can be varied. Considering

𝐴 ∈ 𝓡 𝑠×𝑑 as a sentence matrix, convolution filter

𝐻 ∈ 𝓡 𝒉×𝒅 is applied on 𝐴 to produce its submatrix as new feature 𝐴[𝑖 ∶ 𝑗] . As the convolution operation is applied repeatedly on the matrix of 𝐴 , 𝑂 ∈𝓡 𝑠−ℎ+1×𝑑 as the output sequence is achieved (Eq. 1). 𝑂 𝑖 = 𝑤 𝜊 𝐴[𝑖: 𝑖 + ℎ − 1] (1) Here 𝑖 = 1, … , 𝑠 − ℎ + 1 and 𝜊 is the dot product between the convolution filter and submatrix. Bias term 𝑏 ∈𝓡 and an activation function are also added to each 𝑂 𝑖 . Finally, feature maps 𝐶 ∈ 𝓡 𝒔−𝒉+𝟏 are generated (Eq.2). 𝐶 𝑖 = 𝑓(𝑂 𝑖 + 𝑏) (2) Whereas it is believed that all words in a sentence do not contribute equally to represent the meaning of a sentence, there is a need for a mechanism to emphasize such words that have more impact on the meaning of the sentences. For this aim, we decided to apply an attention mechanism on feature maps extracted from the previous layer. In this regard, feature maps that are extracted from the same filter size are aggregated and form a new matrix. Suppose that in the convolution layer, 𝑀 different region sizes are considered and for each of them 𝑚 different filters are employed. Therefore, after applying 𝐻 𝑖𝑗 ∈ 𝓡 𝒉 𝒊 ×𝒅 filters on sentence matrix 𝐴 where 𝑖 =1,2, … , 𝑀 and 𝑗 = 1,2, … , 𝑚 , 𝑀 × 𝑚 feature map is obtained. By concating feature maps extracted from the same filter size, a new sentence matrix 𝑋 𝑖 ∈ 𝓡 𝑛×𝑚 (Eq.3) is obtained. Where 𝑛 is the number of words and each element of this matrix represents the feature extracted from the input using filters with the same size. 𝑋 𝑖 = [ 𝑥̅ … 𝑥̅ ⋮ ⋱ ⋮𝑥̅ 𝑛−𝑐 𝑖 +1,1 … 𝑥̅ 𝑛−𝑐 𝑖 +1,𝑚 ] (3) The objective of the attention mechanism is to assign specific weight to each row for extracting informative parts of the sentence. For this aim, firstly, new word matrix 𝑋 𝑖 is fed through a single layer perceptron using 𝑤 ∈𝓡 𝒎×𝒅 and 𝑼 𝒊 ∈ 𝓡 𝒏−𝒉 𝒊 +𝟏 ×𝒅 is obtained as a hidden representation of 𝑋 𝑖 (Eq.4). 𝑈 𝑖 = tanh( 𝑋̅ 𝑖 𝑊 + 𝑏) (4) In the following, the importance of each word is measured as the similarity of 𝑈 𝑖 with a content vector 𝑢 ∈𝓡 𝑑×1 to achieve the normalized importance weight 𝑎 𝑖 ∈ 𝓡 𝑛−ℎ 𝑖 +1×1 using Softmax (Eq.5). 𝑎 𝑖 = softmax(𝑈 𝑖 𝑢) (5) Content vector 𝑢 tries to specify informative words. Notably, 𝑢 is set to zero in the beginning to compute the same weight for various rows in the matrix of 𝑋 𝑖 and it is learned during the training process. After that, 𝑋̅ 𝑖 (a new representation of 𝑋 𝑖 ) is computed by multiplying each element of 𝑎 𝑖 to its corresponding row in 𝑋 𝑖 matrix (Eq.6). 𝑋̅ 𝑖 = 𝑎 𝑖 𝑜 𝑋 𝑖 (6) Generally, 𝑋̅ 𝑖 is a new representation of 𝑋 𝑖 while the attention mechanism is applied to it in order to specify the informative words. The whole process of attention layer is schematically presented in Fig. 2. As it can be clearly seen, after merging feature maps extracted from the same filter sizes, 𝑋 𝑖 matrix is created. Then, by applying a single layer perceptron, a new representation of 𝑋 𝑖 known as 𝑈 𝑖 is created. In the following, the normalized importance weight 𝑎 𝑖 , indicating the importance of each word, is computed as the similarity between 𝑈 𝑖 and content vector 𝑢 which is a hyper-parameter and is tuned during the training process. Finally, 𝑋̅ 𝑖 is a new representation of 𝑋 𝑖 is achieved by multiplying each element of 𝑎 𝑖 to its corresponding row in 𝑋 𝑖 . Generally, applying the attention mechanism leads to extracting informative words and assigning more weight to them. The overall process of attention layer is depicted in Fig.2. Fig.2. The overall process of attention layer

While various feature maps according to different filter sizes are generated, a pooling function is required to induce fixed size vectors. Various strategies such as average pooling, minimum pooling, and maximum pooling can be used for this aim and the idea behind them is to capture the most important feature from each feature map and reduce dimensionality.

Generated features from pooling layer from each filter are concated into a feature vector 𝒐 𝒊 . The feature vector is passed to a fully connected

Softmax layer to specify the final classification. In other words,

Softmax determines the probability distribution over all sentiment categories and is calculated as follows (Eq.7). 𝑃 𝑖 = exp( 𝒐 𝒊 )∑ exp( 𝒐 𝒋 ) 𝑐𝑗=1 (7) To clarify the difference between the real sentiment distribution 𝑃 𝑖 ̂(𝐶) and the distribution achieved from the model 𝑃 𝑖 (𝐶) , cross-entropy as the loss function is employed (Eq.8). 𝐿𝑜𝑠𝑠 = − ∑ ∑ 𝑃 𝑖 ̂(𝐶) log( 𝑉𝑖=1𝑠∈𝑇 𝑃 𝑖 (𝐶)) (8) Where 𝑇 is the training set and 𝑉 is the sentiment categories. Stochastic Gradient Descent (SGD) is also used for end to end training of the model. Process 2:

Transfer learning

According to the fact that increasing the number of training data has a significant effect on the performance of deep neural networks[35], we decided to employ transfer learning with the aim of combining two different domains to improve classification accuracy. The transfer learning process that is used in this paper for training the proposed model is depicted in Fig. 3 and Fig.4. As it is clear, firstly, the convolutional neural network integrated with a hierarchical attention mechanism is trained on the source domain based on the process flow that was shown in Fig. 1 and then the trained model is transferred to the target domain. Transferring the trained model also includes two different processes. In the first one, convolutional neural network integrated with a hierarchical attention mechanism does not learn from the target domain. It means that the trained model is only tested on the target domain. While in the second process, the model is trained on the target domain to incrementally learn and update its knowledge. It means that in the second process the model is trained on both source and target domain and can use their combination to increase the classification performance. Finally, this new trained model evaluates the data in the target domain. E XPERIMENTS

Dataset

In order to have a comprehensive investigation of the effectiveness of the proposed model, standard datasets for the aim of sentiment analysis were used in our experiments. While the focus of this paper is on exploring the impact of transfer learning, different datasets are used as source and target domains in our experiments. In other words, transfer learning is used when the target domain is not large enough to be able to train the model accurately. In this regard, we decided to use a large dataset as a source domain and then train the proposed model on a smaller dataset known as the target domain. A brief description of the source and target domains that are used in this paper are provided in the following.  Source domain:

Amazon Review This dataset contains reviews about Amazon products that were collected by Zhang et al. [11]. This dataset has 2 classes (

AMZ-2 ) and 5 classes (

AMZ-5 ) versions .  Target domain:

Stanford Sentiment Treebank This is the dataset that is commonly used for sentiment classification which also contains 2 classes (

SST-2 ) and 5 classes (

SST-5 ) versions. In fact, this it is the extended version of the MR dataset [36] which also contains train/dev/test sets and fine-grained labels . It is worth mentioning that the reason behind using the Amazon review dataset as a source domain refers to its large size that makes it a suitable source and another reason refers to the low degree of semantic similarity between these two datasets. According to the experiments conducted with Zhang et al. [28], as two datasets are lesser semantically similar to each other, applying transfer learning demonstrates greater performance. The reason behind using Stanford Sentiment Treebank as a target domain refers to the fact that this dataset is used in the majority of research conducted for the task of sentiment analysis which can provide this opportunity to compare the proposed model with a wide range of existing models. The summary statistics of these datasets is presented in Table 1. Table 1. Summary Statistics of used datasets Dataset C L S V D S AMZ-5 5 84 3650000 1057296 AMZ-2 2 82 3000000 1112820 D T SST-5 5 18 11855 17836 SST-2 2 19 9613 16185 *D S : Source domain, D T : Target Domain, C : Number of classes, L : Average sentence length, S : Number of sentences, V : Vocabulary size Model Configuration

Generally, one of the downsides of the convolutional neural networks refers to their free number of hyper-parameters which require practitioners to determine the exact model architecture. While the hyper-parameters' values have a significant impact on the performance of deep neural networks, we decided to optimize the proposed model hyper-parameters on the source domain and then apply the optimal parameters on the target domain. In this regard, we first used the CNN configuration proposed by Zhang et al. [37] and tried to carry out extensive sets of experiments to obtain the optimal values for the proposed model. The used baseline configuration is presented in https://goo.gl/bm0IkT http://nlp.stanford.edu/sentiment/ Process 1:

Learning Convolutional Neural Network Integrated with Hierarchical Attention Layer

Result:

Trained Proposed CNN Model

Source Data Target Data Evaluation

Fig.3. Transfer learning without training the model on the target domain.

Result:

Trained Proposed CNN Model

Source Data Target Data Evaluation

Process 1:

Learning Convolutional Neural Network Integrated with Hierarchical Attention Layer

Process 1:

Learning Convolutional Neural Network Integrated with Hierarchical Attention Layer

Fig.4.Transfer learning with training the model on the target domain

Table 2. Notably, 10-fold cross-validation where 10% of training data was randomly selected as a test set was performed and each experiment was repeated 5 times and the average results are reported. Table 2. Baseline configuration

Hyper-parameters Value

Word embedding Random Filter region size 3,4,5 Number of filters 512 Dropout rate 0.5 Activation Function ReLU

Word embedding effect

One of the interesting properties of sentence classification models refers to their ability to use distributed representation of words as input while they can also use various pre-trained word vectors to form initial input. In this section, the sensitivity of the proposed model with respect to different input representations is explored. In this regard, we used 5 input variations in our experiments. Firstly, we used random initialized word vectors as input (

Random ). Then, the model was trained using pre-trained word vectors that were obtained using Word2Vec algorithm (

Word2Vec-Static ). The third one refers to using pre-trained word vectors that were obtained using Word2Vec algorithm and were also updated during the training process (

Word2Vec-nonStatic ). Next, the model was trained using word vectors obtained from a combination of Word2Vec and random initialized vectors ( ). Finally, we used a combination of pre-trained word vectors achieved from Word2Vec, Glove, and FastText as an input while they were also updated during the training of the model ( ). Noteworthy, along the word vectors training process, the dimension of word vectors was considered as 200 and window size was set to 3. Word vectors were updated based on learning rate of {0.1, 0.05, 0.01}. It must be noted that skip-gram structure was utilized in Word2Vec [32] and FastText [34]models while Glove[33] used unigram structure. Table 3. The effect of the word embedding on the performance of the proposed model

Word Embedding Accuracy % AMZ-2 AMZ-5

Random 92.44 57.31 Word2Vec-Static 92.85 57.9 Word2Vec-nonStatic 93.12 58.7 2-channel 93.6 58.6 4-channel

Based on the obtained results (Table 3),

Random word embedding has the lowest classification accuracy among all variations. Better performance of other variations can be also attributed to the employment of pre-trained vectors that can solve the semantic sparsity problem to some degree. In other words, it can be claimed that pre-trained vector representation has a great effect on the performance of the proposed model. Moreover, considering the fact that

Word2Vec-Static embedding has the lowest classification accuracy besides

Random embedding, it can be stated that updating word vectors during the training process can yield to obtain higher performance without considering if the word vectors were previously trained or not. Finally, embedding has the highest classification accuracy. In this regard, we used word embedding to train the proposed model on the target domain.

Filter region size effect

In order to investigate the effect of filter region size, various numbers of region sizes were explored while the other parameters were kept constant. According to previous studies that demonstrated the priority of multiple region sizes in comparison to the single region size, we only used multiple region sizes in our experiments. As can be seen in Table 4, various regions size has a great impact on the performance of the model and the best obtained result is different with baseline value which indicates that the greatest accuracy is obtained while the multiple region size was set as (4, 5, 6). In this regard region size (4,5,6) was used in the target domain.

Number of filter effect

Again, in this set of experiments, other configurations were held constant and we only changed the number of filters in each region. Based on the obtained results (Table 5), it is clear the number of filters has also considerable impact on the performance of the proposed model and the proposed model obtained the highest accuracy while the number of filters was set to 300. Therefore, we used 300 filters to train the model on the target domain. Table 4. The effect of the region size on the performance of the proposed model

Region size Accuracy % AMZ-2 AMZ-5 (3,4,5) 92.44 57.31 (4,5,6) (6,7,8) 91.8 56.85 (8,9,10) 90.78 55.85 (9,10,11) 90.45 55.32 (14,15,16) 90.30 55.24 (3,4,5,6) 91.10 56.07 (6,7,8,9) 90.17 55.14 (7,7,7,7) 92.25 57.03 (7,7,7) 91.43 56.13

Table 5. The effect of the number of filters on the performance of the proposed model

Number of filters Accuracy % AMZ-2 AMZ-5

125 91.5 56.32 256 91.78 56.54

300 92.83 57.74

512 92.44 57.31 450 90.25 55.34 640 90.85 55.95

Regularization effect

Dropout is a technique that is generally used for the aim of regularization and avoiding overfitting. In fact, by using dropout, a number of neurons are randomly selected to be ignored during training which has a great impact on the performance of the model. In this set of experiments, different dropout rates in the range of 0.1 to 0.9 were used to find the optimal rate. Based on the obtained results (Table 6), the highest accuracy was obtained when the dropout rate was around 0.6 and therefore this dropout rate was used to train the model on the target domain. Table 5. The effect of the regularization on the performance of the proposed model

Dropout rate Accuracy % AMZ-2 AMZ-5

Activation function effect

Considering the fact that activation function plays the role of a guide for every input in filters, it is then specified as one of the prominent parts in convolutional neural networks. Different activation functions such as

ReLU , Tanh , SoftPlus and linear can be used convolutional neural networks that each of them has its own characteristics. Based on the obtained results (Table 7), the

ReLU function outperformed the other activation function and therefore it was used to train the model on the target domain . Table 7. The effect of the activation function on the performance of the proposed model

Activation Function Accuracy % AMZ-2 AMZ-5

Tanh 91.45 57.12 Softplus 80.25 40.43

ReLU 92.4 57.31

Linear 90.31 56.25

Results

While the proposed model of this paper is divided into two sections, the obtained results are also categorized into two parts. In this regard, we first investigate the effect of employing a hierarchical attention mechanism on convolutional neural network and then consider the impact of using transfer learning. More details are provided in the following.

Classification results

To create a baseline and provide a fair comparison among the proposed model and another state of the arts, firstly, we only trained our proposed model on the target domain without considering transfer learning. Accuracy comparison of our model against other existing models is provided in Table 8. As it is clear, the proposed model is compared with a wide range of deep neural networks and machine learning models. Based on the obtained results, the proposed model has slightly superior performance compared to existing models and therefore it can be considered as a good option for being used in transfer learning. Table 8. Results of experiments on SST1 and SST2 dataset It must be noted that the optimal values, discussed in the previous section, were used to train the model and the result corresponding to other existing models are taken from their original papers. Moreover, the accuracies were obtained using the available test data

ADADELTA update rule which is an algorithm in the family of stochastic gradient descent and can be trained over shuffled mini-batches was used for optimization. Each experiment was repeated for five times and average results reported.

Model Accuracy % SST-2 SST-5 NB [26] [26] [12] [12] [22] [18] [18] [22] [24] [24] [24] [38] [38] [27] [26] [26] [19] Proposed model 90.57 51.31

0 To have a broader view of the performance of the proposed model, more analysis has been also carried out. In fact, the aim of applying an attention mechanism on CNN is to focus on more relevant words. Without the attention mechanism, CNN might also work well and assign a high and low weight to important and not important words respectively without considering the context. While the importance of a particular word is highly dependent on the context, the goal of the proposed model is to capture context-dependence importance. To clarify the performance of the proposed model in recognition of the importance of the word based on the context, the distribution of the attention weight of words good and bad from test split of SST1 dataset is presented in Fig.5 (a, b). According to the distribution, the assigned attention weight is in the scale of 0 to 1. This specifies that the potential of the proposed method is in capturing diverse context and assigning context dependent weight. In order to have more comprehensive analysis, the distributions of words good and bad are plot according to the rating of reviews in Fig. 6 and 7 (a)-(e) corresponding to the ratings 1 to 5 respectively. Notably, the Fig. 6 is related to word good and the Fig. 7 is related to the word bad . As can be clearly seen, in reviews with rating 1, the words good and bad have the lowest and highest weight respectively. In the following, as the rating is enhanced, the weight distribution for the word good is increased while it is decreased for the word bad . It indicates that positive words such as good have a more important role in higher rating reviews while negative words such as bad have more effect in lower rating reviews. In other words, it can be stated that for the word good as the rating goes higher, the distribution also shifts higher. In contrast, the word bad has a higher distribution in poor ratings while it decreases for good ones. This indicates that the proposed model is able to capture the importance of words without considering the context (a) (b) (c) (d) (e)

Fig.6. Distribution of attention weight of words good according to the ratings (1 to 5) (a) (b) (c) (d) (e)

Fig.7. Distribution of attention weight of words bad according to the ratings (1 to 5)

Fig.5. Distribution of attention weight of words good (a) and bad (b) (a) (b) Transfer learning results

One of the big challenges that deep neural networks are confronted with refers to the lack of enough labeled training data. In fact, the performance of deep neural networks is highly dependent on the number of data and increasing the number of training data has a significant effect on their performance. Transfer learning here comes to help increase the size of the training set. Following a similar line of research, we first trained our model on the Amazon review dataset and transfer it to Stanford Sentiment Treebank. As it was previously mentioned, two different learning process is used in the proposed model (Fig.3 and Fig.4). In one of them, the proposed model is directly used for sentiment classification in the target domain and in the other one, the model is incrementally trained on the target domain using the optimal values that previously discussed. In order to make a comparison between these transferring processes, the accuracy of the proposed model without and with incremental learning in the target domain (Source domain: AMZ2, Target domain: SST2) is presented in Table 9. As it is clear, the accuracy of the model is very low when it is directly used for classification which indicates that the knowledge obtained in the source domain is not enough to be applied in the target domain. On the other hand, when the model is incrementally learned in the target domain, the accuracy is significantly improved. Table 9. Transfer learning performance comparison with and without incremental learning

Description

Accuracy (%)

Transfer learning without incremental learning 74.3 Transfer learning with incremental learning 91.25

In order to provide more analysis of the impact of transfer learning on sentiment classification, the results of employing transfer learning with incremental learning on all variations of source and target domains are presented in Table 10. As it is clear, employing transfer learning has generally enhanced the overall classification performance. Specifically, although AMZ5  SST2 and AMZ2  SST5 have different number of classes demonstrated the highest performance. In fact, it can be stated that employing transfer learning has remarkably enhanced the performance of the proposed model which can be yielded to the large size of the source domain and richer embedding which helps the model to learn contextual information better. Table 10. Accuracy (%) of transfer learning on different variations of source and target domains D T  SST-2  SST-5 D S AMZ-2 AMZ-5 AMZ-2 AMZ-5

Proposed model C ONCLUSION

The contribution of this paper is twofold. Firstly, a new convolution neural network integrated with the attention mechanism for the aim of sentiment analysis is proposed that tries to consider the context to determine the polarity of sentences. In fact, it can not only predict the sentimental label of a sentence but also find informative words that effectively contribute to predicting the overall classification decision. Generally, the proposed model progressively forms sentence vectors by aggregating informative words vectors achieved from attention layer into feature maps extracted from the convolutional layer and employ these new generated features for classification. Secondly, the sensitivity of the proposed model parameters is comprehensively investigated and the effect of transfer learning is explored. According to the empirical results and due to the best of our knowledge, the proposed convolution neural network integrated with the attention mechanism significantly outperformed other existing models. Moreover, employing transfer learning has greatly improved the classification accuracy. In fact, the proposed model obtained an accuracies of 90.57% and 51.31% on SST-2 and SST-5 datasets respectively without transfer learning. On the other hand, by applying transfer learning the classification accuracy has increased and the obtained accuracies were about 92.38% and 53.63% on SST-2 and SST-5 datasets respectively. Following a similar line of research, the proposed model using transfer learning can be performed in another target domain or for other natural language processing tasks. Applying transfer learning on other deep neural networks is also worth exploring. 2 R

EFERENCES

1. Wang, R., et al.,

A Survey on Opinion Mining: From Stance to Product Aspect.

IEEE Access, 2019. : p. 41101-41124. 2. Sadr, H., M.M. Pedram, and M. Teshnehlab, Convolutional Neural Network Equipped with Attention Mechanism and Transfer Learning for Enhancing Performance of Sentiment Analysis.

Journal of AI and Data Mining, 2021: p. -. 3. Sadr, H., M.M. Pedram, and M. Teshnehlab,

A Robust Sentiment Analysis Method Based on Sequential Combination of Convolutional and Recursive Neural Networks.

Neural Processing Letters, 2019: p. 1-17. 4. Sadr, H., M.M. Pedram, and M. Teshnehlab,

Multi-View Deep Network: A Deep Model Based on Learning Features From Heterogeneous Neural Networks for Sentiment Analysis.

IEEE Access, 2020. : p. 86984-86997. 5. Xie, X., et al., An improved algorithm for sentiment analysis based on maximum entropy.

Soft Computing, 2019. (2): p. 599-611. 6. Sadr, H., M.M. Pedram, and M. Teshnelab, Improving the Performance of Text Sentiment Analysis using Deep Convolutional Neural Network Integrated with Hierarchical Attention Layer.

International Journal of Information and Communication Technology Research, 2019. (3): p. 57-67. 7. Yang, Z., et al. Hierarchical attention networks for document classification . in

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2016. 8. Sadr, H., et al.,

Exploring the Efficiency of Topic-Based Models in Computing Semantic Relatedness of Geographic Terms.

International Journal of Web Research, 2019. (2): p. 23-35. 9. Zhang, Z., Y. Zou, and C. Gan, Textual sentiment analysis via three different attention convolutional neural networks and cross-modality consistent regression.

Neurocomputing, 2018. : p. 1407-1415. 10. Jadidinejad, A.H. and H. Sadr,

Improving weak queries using local cluster analysis as a preliminary framework.

Indian Journal of Science and Technology, 2015. (5): p. 495-510. 11. Zhang, X., J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification . in

Advances in neural information processing systems . 2015. 12. DU, C. and L. HUANG,

Sentiment Classification Via Recurrent Convolutional Neural Networks.

DEStech Transactions on Computer Science and Engineering, 2017(cii). 13. Sadr, H., R. Atani, and M. Yamaghani,

The Significance of Normalization Factor of Documents to Enhance the Quality of Search in Information Retrieval Systems.

International Journal of Computer Science and Network Solutions, 2014. (5): p. 91-97. 14. Sadr, H., M. Nazari Solimandarabi, and M. Mirhosseini Moghadam, Categorization of Persian Detached Handwritten Letters Using Intelligent Combinations of Classifiers.

Journal of Advances in Computer Research, 2017. (4): p. 13-21. 15. Soleimandarabi, M.N., S.A. Mirroshandel, and H. Sadr, The Significance of Semantic Relatedness and Similarity measures in Geographic Information Science.

International Journal of Computer Science and Network Solutions, 2015. 16. Dong, X. and G. De Melo.

A helping hand: Transfer learning for deep sentiment analysis . in

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2018. 17. Sadr, H. and M. Nazari Solimandarabi,

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures.

Journal of Advances in Computer Research, 2019. (2): p. 1-10. 18. Kim, Y., Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. 19. Socher, R., et al.,

Semantic Compositionality through Recursive Matrix-Vector Spaces.

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics., 2012. 20. AL-Smadi, M., et al.,

Deep Recurrent Neural Network vs. Support Vector Machine for Aspect-Based

Sentiment Analysis of Arabic Hotels’ Reviews.

Deep Belief Networks with Feature Selection for Sentiment Classification.

Uksim.Info,pp. 16, 2016. 22. Kalchbrenner, N., E. Grefenstette, and P. Blunsom,

A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014. 23. Yin, W., et al.,

Abcnn: Attention-based convolutional neural network for modeling sentence pairs. arXiv preprint arXiv:1512.05193, 2015. 24. Tai, K.S., R. Socher, and C.D. Manning,

Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015. 25. Kuta, M., M. Morawiec, and J. Kitowski,

Sentiment Analysis with Tree-Structured Gated Recurrent Units.

Springer International Publishing AG 2017 3 26. Socher, R., et al.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank . in

EMNLP . 2013. 27. Wang, Y., M. Huang, and L. Zhao.

Attention-based LSTM for aspect-level sentiment classification . in

Proceedings of the 2016 conference on empirical methods in natural language processing . 2016. 28. Semwal, T., et al.

A practitioners' guide to transfer learning for text classification using convolutional neural networks . in

Proceedings of the 2018 SIAM International Conference on Data Mining . 2018. SIAM. 29. Krizhevsky, A., I. Sutskever, and G.E. Hinton.

Imagenet classification with deep convolutional neural networks . in

Advances in neural information processing systems . 2012. 30. Sermanet, P., et al.,

Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. 31. Sadr, H., et al.

Unified Topic-Based Semantic Models: A Study in Computing the Semantic Relatedness of Geographic Terms . in . 2019. IEEE. 32. Mikolov, T., et al.,

Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 33. Pennington, J., R. Socher, and C. Manning.

Glove: Global vectors for word representation . in

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . 2014. 34. Bojanowski, P., et al.,

Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016. 35. Ruder, S.,

Neural Transfer Learning for Natural Language Processing . 2019, NATIONAL UNIVERSITY OF IRELAND, GALWAY. 36. Pang, B. and L. Lee.

Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales . in

Proceedings of the 43rd annual meeting on association for computational linguistics . 2005. Association for Computational Linguistics. 37. Zhang, Y. and B. Wallace,

A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820, 2015. 38. Kokkinos, F. and A. Potamianos,